-
This is a real n00b question. Sorry if I'm missing something obvious.
I've pointed the Explorer at a set of WARC and ARC files and can get results back from my local Wayback machine query interface. …
-
I discovered today that warcio mangles the HTTP header data when it isn't pure ASCII. Specifically, I am dealing with a server that returns ISO-8859-1 headers.
As far as I can tell, this behaviour …
-
**Issue:** JHOVE WARC-KB module gives different results compared to JWAT
Because JHOVE WARC-KB module uses JWAT-WARC library it's expected that output results are similar
E.g.
The WARC file [LU…
-
Some crawlers could create multiple WARC files, it's importand if we had to upload WARC files to storages with limitation on single file size. I have a lot of archives websites splitted to 5-50 5GB WA…
ivbeg updated
2 years ago
-
Can we add a new warc reader using the [fastwarc](https://resiliparse.chatnoir.eu/en/latest/man/fastwarc.html)?
It is said to be much more [efficient](https://arxiv.org/abs/2112.03103) than warcio
-
Not really an issue, but a limitation of https://github.com/eligrey/FileSaver.js .
This could be problematic when appending to existing WARCs or creating WARCs from multiple web pages at once.
-
Definition: nothing is said on the HTTP 2 protocol, which could give the impression that WARC files cannot harvest documents in HTTP2.
Decision: few sentences on the handling of HTTP 2.X protocol sho…
-
Hi @Marlin-Na,
while searching for examples how Common Crawl data is used, I stumbled over this nice project and just looked at the following comments:
https://github.com/Marlin-Na/CommonCrawlDL/b…
-
I successfully (I think) captured and generated a warc file using https://electronjs.org/docs/api/debugger.
I tried a simple site: www.drupal.org
If I capture the first load, it seems to work ni…
-
On my own site (e.g., https://matkelly.com), I reference some fonts to be included and used in the CSS of the web page, e.g.,
``````
The resource resolution procedure never fetches these, so th…