-
We have all the information, why not let SolrWayback act as a [CDX-server](https://github.com/webrecorder/pywb/wiki/CDX-Server-API), så that there is not a need for a separate index? (suggested by Yve…
-
Heritrix has problems with `` support: https://github.com/internetarchive/heritrix3/pull/179/ resulting in attempted harvesting of `img1%20img2` instead of `img1` and `img2`. Surprisingly it works wit…
-
We currently have:
https://github.com/ukwa/ukwa-heritrix/blob/da5ebf02010f3e3abfed4025db5f6a3d0c6a4631/src/main/java/uk/bl/wap/modules/extractor/ExtractorHTTPWellKnownURIs.java#L25
But as per ht…
-
Hello,
I'm attempting to archive some of our agency sites and have run into this issue.
The agency sites themselves do not go through our proxy, and pages are archived fine.
However there is con…
-
Dear development team,
We would like to crawl our intranet, and have installed Heritrix 3.4 on a Linux server.
The crawling starts, but stops immediately at the authentication phase.
Our intranet…
-
I'm currently setting up da Herittrix/WCT/OWA stack and have come quite far (first crawls running, able to see them in WCT interface and do quality review).
However, in the heritrix logs I am seein…
-
Websites / departments in my organisation usually have a robots.txt with the following simple entry:
```
User-agent: *
Disallow: /*?*
Sitemap: https://www.[domain].org/sitemaps/[domain].xml
```
…
-
Dear devs,
I have tried to install heritrix3 on the command line on my system and have found a problem.
The output of `uname -a` is:
```
Linux lenovomlf 4.15.0-99-generic #100-Ubuntu SMP Wed Apr 2…
-
I am extending this module of heritrix `org.archive.modules.fetcher.FetchHTTP` and overriding the innerProcess method to make a headless browser get the content instead of the builtin heritrix http re…
-
Hi,
I'm writing a paper and would like to cite heritrix. What/Who/... do I need to reference? I found the following:
- [Mohr, G., Stack, M., Rnitovic, I., Avery, D., & Kimpton, M. (2004, July). An…