warc-files Search Results

1000+ results
for warc-files

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

lintool/warcbase #250

use WET files from CommonCrawl

CommonCrawl has the [WET files](http://commoncrawl.org/the-data/get-started/), which are WARC files where HTML response has been converted to plain text (and non html pages has been removed). Is it p…

dportabella updated 8 years ago
7
killercup/static-filez #9

Define the archive format

Let's define the format of out archives. ## Current state A binary file that is actually just concatenated gzip blobs. Features: 1. Extract gzip files 2. Append is trivial ## Prior art…

killercup updated 4 years ago
10
PromyLOPh/crocoite #10

Replace warcio

The API is not exactly pretty and it’s easy to mess things up. There are no plausibility checks and no validation. We want: - A nice/clean API that separates WARC and its payloads. warcio mixes WAR…

PromyLOPh updated 5 years ago
2
webrecorder/pywb #370

Possible to index and replay web archive with custom archive…

I have a web archive with a custom directory structure (recorded in other software). Is it possible to scan this structure automatically for new warc files without moving them to the pywb collection f…

peterk updated 3 years ago
2
iipc/warc-specifications #1

Add guidelines to describe recording WARC-level provenance?

It seems that there is a desire to record provenance of WARC files, e.g. in the case of concatenation. See http://ws-dl.blogspot.co.uk/2014/09/2014-09-02-warcmerge-merging-multiple.html That proposal…

anjackson updated 9 years ago
3
cs531-f19/discussions #85

WARC File Comparison

Compare and contrast the resulting WARC files on the `https://odu.edu/compsci` URI generated by any two of the following tools: * [Wget](https://www.gnu.org/software/wget/manual/wget.html#index-WARC)…

ibnesayeed updated 4 years ago
4
commoncrawl/commoncrawl #17

Different formats ?

is it possible to be able to get this in Zim file format to use with https://kiwix.org/en/ this is an ofline internet project which enable for the creation of zim files an archive which can be browse…

spydaz updated 3 months ago
1
netarchivesuite/webarchive-discovery #8

Support offsets for Twitter JSON-Lines

The `some`-branch adds support for direct indexing of JSON-Lines from the Twitter API. Retrieval is handled by storing the raw JSON in the Solr field `tw_json`, but that inflates the index. Another…

tokee updated 3 years ago
1
internetarchive/Sparkling #3

`s3a` URLs don't work in `WarcLoader` (`Wrong FS: s3a://...`…

EDIT: this helped with `Wrong FS`, more tickets incoming ;) ```sc.hadoopConfiguration.set("fs.defaultFS", "s3a://commoncrawl/")``` Hey folks, I'm trying to read some common crawl data from S3...…

acruise updated 7 months ago
1
netarchivesuite/solrwayback #377

Experimental test of speeding up WARC-indexer

It is worth testing how much speed up is gained by not recalculating SHA-1 hash and trust the WARC-header instead. Notice for old ARC files, we still have to calculate the hash.

thomasegense updated 1 year ago
1

上一页 1...7 8 9 10 11 12 13...100 下一页

1000+ results for warc-files

1000+ results
for warc-files