-
Received the following error more than once:
```
panic: open jobs/warcs/TCPK-20240826191443122-00001-crawl918.us.archive.org.warc.gz.open: too many open files
goroutine 119 [running]:
github.com…
-
### Browsertrix Cloud Version
v1.9.3-79a217b
### What did you expect to happen? What happened instead?
I have found some new WARC fields and files in the newest WACZ from beta.browsertrix release: …
-
I was surprised that example provided in documentation:
``` python
>>> import warcat.model
>>> warc = warcat.model.WARC()
>>> warc.load('example/at.warc.gz')
>>> len(warc.records)
```
Reads everythi…
sirex updated
2 weeks ago
-
Following up on #270, we want to continue migrating backups from S3 to B2. This should include:
- [x] rss-fetcher postgres backups (old files migrated, new files written to B2)
- [x] start writin…
-
Honestly, I feel like I should implement a command-line switch to generate WARC files while downloading threads, so I can upload them to the [Wayback Machine](http://web.archive.org/) or do whatever e…
-
Hello,
I would like to know if it possible to get both warc files compressed (not only the metadata one)
Thanks
nasry updated
6 years ago
-
Originated from https://github.com/webrecorder/warcio.js/issues/21#issuecomment-816835171
Files (links expire in 7 days):
- Broken file from warcio.js: https://share.fromtheexchange.space/file/sp…
-
Sorry for asking a question not related to grab-site, but I don't really know where should I ask it.
I archived a big forum that recently went down. Unfortunately without the search function it's a…
-
340 WARC files of the news crawl data set, starting from 2020-09-12 until 2020-10-04 have been captured using [HTTP/2](https://en.wikipedia.org/wiki/HTTP/2) after a [Java security upgrade](https://mai…
-
### Description
Add a Datasource for reading data from WARC/ARC files.
### Use case
In cleaning of pre-training data for LLM, Ray Data is nearly the only distributed solution (Dask appears to be le…