-
If the API times out or the script breaks in the middle of creating an AIP, it currently has to be deleted before the script runs again in order for it to be correctly finished. For AIPs with a lot of…
-
The import fails with `sqlite3.IntegrityError: NOT NULL constraint failed: payloads.hash` if a WARC record does not include a WARC-Payload-Digest. This is the case for record types which are not suppo…
-
I have WARC files collected with node-warc 3.1.0 that can not be opened in Webrecorder player (No pages found). The only discerning characteristic is that the files are archived from Facebook posts wi…
-
There was a suggestion that the extension include a function to be able to save the currently viewed website as a Web ARChive (WARC) file locally on the user's computer. This could be a feature for a …
-
I am attempting to index [a WARC from Archive-It](https://matkelly.com/IA/ARCHIVEIT-2349-ANNUAL-KBAWJW-20110217001046-00000-crawling113.us.archive.org-6682.warc) using ipwb from the current master bra…
-
I'm seeking feedback on a decision regarding setting a _sensible default_ for writing WARC records to the distributed web. It has implications for de-duplication between archives, and might also have …
-
Given [this whitespace-related header bug](https://github.com/commoncrawl/nutch/issues/5) that crept into the August 2018 Common Crawl crawl , it would be nice if it was somewhat difficult to create b…
-
The properties in solrwaybackweb.properties:
export.csv.maxresults=10000000
export.warc.maxresults=1000000
export.warc.expanded.maxresults=10000
Are used to stop too large export. But the count …
-
[WARC](http://iipc.github.io/warc-specifications/) is well-known format for storing crawled captures. It can store arbitrary number of HTTP requests and responses along with other network interactions…
-
The WARC parsing sometimes results in records being truncated.
This might be due to the parser continuing to look for newlines/read one line at a time, even when parsing the content body, and might…