-
It seems that calling `warc-indexer` with thousands of WARC-files causes the `tmp` folder to fill up (maybe due to DROID temporary files). It should possible to clean up underway.
-
## Describe the bug
A scrape is consistently producing two WARC files that cannot be loaded by replayweb.page. I was having some issues using warcat on the files produced by this scrape, too, but I h…
-
The GZIP spec includes support for one or more members `(A gzip file consists of a series of "members" (compressed data sets).`
but this spec currently states `A gzip stream may only contain one "me…
-
The WARC parsing sometimes results in records being truncated.
This might be due to the parser continuing to look for newlines/read one line at a time, even when parsing the content body, and might…
-
WARC inherited line folding from HTTP which presumably included it for compatibility with MIME messages which have line length limits. The newer HTTP RFCs [deprecated it](https://datatracker.ietf.org/…
-
This is a real n00b question. Sorry if I'm missing something obvious.
I've pointed the Explorer at a set of WARC and ARC files and can get results back from my local Wayback machine query interface. …
-
We use pywb to serve a ~ 488 MB WARC file (https://webrecorder.io/layoutanalysis/2015_2016) to [scrapy](https://scrapy.org/)/[splash](http://splash.readthedocs.io/en/stable/), which injects a layout a…
-
Asked by Andy Jackson
> Secondly, when using the WARC writer, how does it cope with large downloads? We sometimes see > 2GB files - would it handle those?
Need to test the WARC writer in isolation t…
-
In some of the mega WARCs produced by Archive Team, extracting all the WARCs to save just a few is infeasible as it can take at least 2 days to extract them all using `warcat`.
One might have already…
gwern updated
8 years ago
-
Was wondering if it was possible to use this as a website specific search, in place of the "powered by google" search you often see. If so what would the process of setting this up look like? I did tr…