-
Hi,
I have some warc files created using [warcit](https://github.com/webrecorder/warcit). Somehow after indexing (without errors or warnings), I can't find any page included in it on SolrWayback. I…
-
We've set up WCT and set the operator contact URL in the profile, however this data does not seem to propagate to the heritrix job configuration. I've attached four screenshots. Any idea what the prob…
-
For an in company use of webarchives, I'm experimenting transforming older Heritrix crawls to wacz (thanks to py-wacz). One of these transforms results in 48GB and reports 325.000 pages. Using this …
-
Using the latest version (20210803) and a lot of versions before that, when the job is terminated, one CPU thread seems to be stuck at 100% doing nothing. This never goes away until I restart Heritrix…
-
- user agent getting blocked
- having problems with some characters in URLs.
-
Use warcio.js to write rendered versions to WARCs rather than pushing to the proxy (which limits us to using warcprox).
make sure WARC records are the same as under warcprox implementation
Rota…
-
I am trying to extend heritrix, i have configured my pom.xml like this to build a single JAR with all the heritrix dependencies
```
4.0.0
io.test
extended-heritrix
1.0-SNA…
-
- [ ] Change Heritrix User Agent --> "Arquivo-web-crawler (compatible; heritrix/3.4.0-20200304 +https://arquivo.pt/faq-crawling)"
- [ ] Add User Agent to Arquivo Patcher
- [ ] Update http://arquivo.…
-
```
SEVERE: org.archive.crawler.framework.CrawlJob beansException Failed to start bean 'warcWriterViralOld'; nested exception is java.lang.RuntimeException: java.io.FileNotFoundException: File '/heri…
-
Error happens with latest openjdk 16.0.1.
Works fine with LTS version (openjdk 11.0.11) .
```
Sat Jul 24 09:57:57 PM EEST 2021 Starting heritrix
Linux f34 5.13.4-200.fc34.x86_64 #1 SMP Tue Jul…