-
In Advanced>Heritrix, if a crawl job has been created but not yet built (crawler-beans.cxml is the sole file), and "Rebuild crawl job" is selected from the contextual menu, the information about the j…
-
We hit occasional problems during checkpoint writing. These are mostly due to a non-checkpoint log file not being present when attempting to make a checkpoint, due to some earlier issue (not 100% cle…
-
**Description**:
In august 2021 chrome will change installability criteria and PWA will not be installable if the page is not available offline.
Currently we have an offline support and a offline pa…
-
Hi,
Some days ago I created a WARC file with Heritrix. Webrecorder Players discovers around 10.000 pages; replayweb 0. There certainly are pages and URL's in that WARC-file. Is this a bug? Or maybe…
-
Does Heritrix 3.3 support wildcards in robots.txt disallow directives? I urge that either "yes" or "no" answer be added to the documentation.
From my experimentation, it appears that it does not …
wroth updated
3 years ago
-
Each WARC record has 3 subsection, **response, request, metadata.**
How can I _access_ each individual subsection?
It seems that ArchiveSpark returns **only** the **response** subsection or am I…
-
When crawling a site, links marked `nofollow` were not followed even through the site was marked as `ignoreRobots`. It turns out that using the `calculateRobotsOnly` method used in the `ignoreRobots`…
-
Occasionally, QA PyWb will have difficulty playing back an instance. The error will look like this in QA PyWb:
![image](https://user-images.githubusercontent.com/18530934/97594812-f2be2c80-19fa-11eb-…
-
The observador.pt website at 2015 have a poor replay quality. This is not a robots.txt problem and this websites is being harvested daily. http://arquivo.pt/wayback/20150419012300/http://observador.pt…
-
Installed Heritrix 3.3.0 on a Linux server. (3.4.0 fails consistently when editing a configuration.) Out-of-the-box configuration, just set the seed and the operatorContactUrl.
I tell it to crawl…
wroth updated
3 years ago