-
Preface: I don't think this is an OpenWPM bug, but I'm wondering if anybody has suggestions on how to work around it.
Platform: OpenWPM v0.28, Selenium 4.21.0, Geckodriver 0.34.0.
When the ext…
-
Hi all,
as previously announced in Slack, we wanted to classify the URLs, and we hope to have this solved soon. We classified over 110M different hostnames. In this issue, I want to give you an ov…
nrllh updated
3 months ago
-
## Summary
Called out in our Slack channel, but Greenwood should definitely have some support for sitemaps, which are an XML file used to tell Search Engines about the content and pages contained wit…
-
Hi All!
I realize this should largely be about the actual 'crawling' of the sites - but given this was such a breeze with this tool I now find myself with the issue that the text that has been cra…
-
On veut faire un crawling par site génératif
La premiere solution en V0:
* faire une base de connaissance et construire un crawler qui serait génératif -> ne parvient qu'au site généré
L…
-
`AMIDownloadTool` is a wrapper for various ways of crawling scraping sites. The best developed is `biorxiv` . This is complex:
* Manual search on `biorxiv` gives a hit list in HTML
* we turn this in…
-
Please deploy to a central maven repository, e.g. http://central.sonatype.org/pages/ossrh-guide.html, so we can use the 2.0-SNAPSHOT and 1.2 release from maven. It's a bit of a nightmare trying to inc…
-
The API is up. But need some time to avoid usage of this API. I'm trying to avoid API calls due to frequent updates of API data is required. This can be removed as `p1` and Just add a `backlog` tag/la…
-
To simplify crawling of hosted SCM instances owned by public administrations, we could come up with something like:
```html
```
By putting this into the homepages of administration's sites…
rasky updated
6 years ago
-
```
Version 0.1a of act.jar seems to be using Crawljax2.0.
Version 2.0 has an issue crawling some AJAX sites.
An example is
http://demo.tutorialzine.com/2009/09/simple-ajax-website-jquery/demo.html
…