-
CommonCrawl has the [WET files](http://commoncrawl.org/the-data/get-started/), which are WARC files where HTML response has been converted to plain text (and non html pages has been removed).
Is it p…
-
Let's define the format of out archives.
## Current state
A binary file that is actually just concatenated gzip blobs.
Features:
1. Extract gzip files
2. Append is trivial
## Prior art…
-
The API is not exactly pretty and it’s easy to mess things up. There are no plausibility checks and no validation. We want:
- A nice/clean API that separates WARC and its payloads. warcio mixes WAR…
-
I have a web archive with a custom directory structure (recorded in other software). Is it possible to scan this structure automatically for new warc files without moving them to the pywb collection f…
-
It seems that there is a desire to record provenance of WARC files, e.g. in the case of concatenation. See http://ws-dl.blogspot.co.uk/2014/09/2014-09-02-warcmerge-merging-multiple.html
That proposal…
-
Compare and contrast the resulting WARC files on the `https://odu.edu/compsci` URI generated by any two of the following tools:
* [Wget](https://www.gnu.org/software/wget/manual/wget.html#index-WARC)…
-
is it possible to be able to get this in Zim file format to use with https://kiwix.org/en/
this is an ofline internet project which enable for the creation of zim files an archive which can be browse…
-
The `some`-branch adds support for direct indexing of JSON-Lines from the Twitter API. Retrieval is handled by storing the raw JSON in the Solr field `tw_json`, but that inflates the index.
Another…
tokee updated
3 years ago
-
EDIT: this helped with `Wrong FS`, more tickets incoming ;)
```sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")```
Hey folks, I'm trying to read some common crawl data from S3...…
-
It is worth testing how much speed up is gained by not recalculating SHA-1 hash and trust the WARC-header instead.
Notice for old ARC files, we still have to calculate the hash.