-
The synthesize warc command will unintentionally switch back to the original stream instead of the raw stream. The bug seems to be resolved by making deep copies of all variables from the original str…
-
EDIT: this helped with `Wrong FS`, more tickets incoming ;)
```sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")```
Hey folks, I'm trying to read some common crawl data from S3...…
-
As well as moving properly-named WARCPROX WARCs into place, `tidy_warcs` should:
- [ ] Move default-named `WARCPROX-###` WARCs into the closest matching job folder.
- [ ] Find any older WARCs that…
-
From file tinypic_20190902063634_c9c1ee22.megawarc.warc.gz on IA:
WARC-Filename: %(warc_file_base)s-deduplicated.warc.gz
-
Archiveteam is actively using various warc compressions like megawarc.zst or warc.xz for example. What do you think about it, should be implemented or not? Which formats should be supported?
https:…
-
I'm trying to upload files captured with webrecorder to conifer, as webrecorder doesn't seem to have such a difficulty accessing facebook. I dowloaded the .wacz and unzipped the file and tried upload…
-
-
One important alternate application of this library would be to export data from the WARC files, to output HTML and other metadata.
For example, the Internet Archive has the only snapshots of 4chana…
-
A .bat file containing the line `warc-peek.py www.reddit.com-inf-20180420-085636-7m9j5-00001.warc.gz 2447 1306208683` using [this warc](https://archive.org/download/archiveteam_archivebot_go_201804202…
-
Definition: nothing is said on the HTTP 2 protocol, which could give the impression that WARC files cannot harvest documents in HTTP2.
Decision: few sentences on the handling of HTTP 2.X protocol sho…