-
Opening here for you to triage ; run [67615](https://farm.zimit.kiwix.org/pipeline/67615c43-0078-483f-a016-d14ed92cfc8f/debug) failed when warc2zim tried to load one of the WARC
```
Processing WAR…
-
CombineWARC seems to create warcs from all the warcs in the folder after one run, but there is no way to create limited size warcs out of one run?
For example: if one has one crawl running daily, …
-
This is a follow-up to https://github.com/webrecorder/browsertrix-crawler/issues/451. In this issue, we (Kiwix) had asked to have information about the initiator of the request in a WARC header to dif…
-
I am using the local executor. My machine has 48 Cpus with 348 Ram. Any idea how to speed this up? Currently one single task (task=1, running for 1 warc.gz file, with size ~1g) takes half an hour. Thi…
-
Following up on #270, we want to continue migrating backups from S3 to B2. This should include:
- [x] rss-fetcher postgres backups (old files migrated, new files written to B2)
- [x] start writin…
-
WARC support would be great. It's used at-scale web archives across the world as the standard file format for web archiving. More information at https://en.wikipedia.org/wiki/WARC_(file_format)
Mos…
-
### Description
Add a Datasource for reading data from WARC/ARC files.
### Use case
In cleaning of pre-training data for LLM, Ray Data is nearly the only distributed solution (Dask appears to be le…
-
### ArchiveWeb.page Version
v0.11.3
### What did you expect to happen? What happened instead?
When downloading WARC 1.1 and ingesting them in SolrWayback via UKWA warcindexer I expected to have rep…
-
I was wondering if it would be possible to allow cat with wildcard:
when there is a folder with some warcs inside a command line usuage like:
` warc cat *.warc.gz >> combine_full.warc.gz`
would be…
-
### Browsertrix Cloud Version
v1.9.3-79a217b
### What did you expect to happen? What happened instead?
I have found some new WARC fields and files in the newest WACZ from beta.browsertrix release: …