warc-files Search Results

CorentinB/warc #43

Panic on too many open files error, should retry instead

Received the following error more than once: ``` panic: open jobs/warcs/TCPK-20240826191443122-00001-crawl918.us.archive.org.warc.gz.open: too many open files goroutine 119 [running]: github.com…

willmhowes updated 1 month ago

webrecorder/browsertrix #1588

Document new WARC fields in 1.x crawler-produced WACZ files

### Browsertrix Cloud Version v1.9.3-79a217b ### What did you expect to happen? What happened instead? I have found some new WARC fields and files in the newest WACZ from beta.browsertrix release: …

tuehlarsen updated 1 week ago

chfoo/warcat #14

Add easy way to iterate over warc records

I was surprised that example provided in documentation: ``` python >>> import warcat.model >>> warc = warcat.model.WARC() >>> warc.load('example/at.warc.gz') >>> len(warc.records) ``` Reads everythi…

sirex updated 2 weeks ago

mediacloud/story-indexer #291

migrate more backups to BackBlaze to reduce costs

Following up on #270, we want to continue migrating backups from S3 to B2. This should include: - [x] rss-fetcher postgres backups (old files migrated, new files written to B2) - [x] start writin…

rahulbot updated 1 month ago

bibanon/BASC-Archiver #3

Generating WARC files

Honestly, I feel like I should implement a command-line switch to generate WARC files while downloading threads, so I can upload them to the [Wayback Machine](http://web.archive.org/) or do whatever e…

DanielOaks updated 9 years ago

netarchivesuite/netarchivesuite #34

Compressed warc files

Hello, I would like to know if it possible to get both warc files compressed (not only the metadata one) Thanks

nasry updated 6 years ago

webrecorder/warcio.js #22

Invalid Warc Files

Originated from https://github.com/webrecorder/warcio.js/issues/21#issuecomment-816835171 Files (links expire in 7 days): - Broken file from warcio.js: https://share.fromtheexchange.space/file/sp…

jlarmstrongiv updated 3 years ago

ArchiveTeam/grab-site #166

Make WARC files searchable

Sorry for asking a question not related to grab-site, but I don't really know where should I ask it. I archived a big forum that recently went down. Unfortunately without the search function it's a…

Svekla updated 9 months ago

commoncrawl/news-crawl #42

Do not use "http/2" protocol version in HTTP headers in WARC…

340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using [HTTP/2](https://en.wikipedia.org/wiki/HTTP/2) after a [Java security upgrade](https://mai…

sebastian-nagel updated 3 months ago

ray-project/ray #45535

[Data] Add WarcDatasource for reading WARC/ARC files

### Description Add a Datasource for reading data from WARC/ARC files. ### Use case In cleaning of pre-training data for LLM, Ray Data is nearly the only distributed solution (Dask appears to be le…

ryan-minato updated 4 months ago

1000+ results for warc-files

1000+ results
for warc-files