-
Hi,
When I ran the following command to download the dataset from hugginigface hub, I encountered an error:
My command:
```
from datasets import load_dataset
ds = load_dataset("mlfoundation…
-
Received the following error more than once:
```
panic: open jobs/warcs/TCPK-20240826191443122-00001-crawl918.us.archive.org.warc.gz.open: too many open files
goroutine 119 [running]:
github.com…
-
Opening here for you to triage ; run [67615](https://farm.zimit.kiwix.org/pipeline/67615c43-0078-483f-a016-d14ed92cfc8f/debug) failed when warc2zim tried to load one of the WARC
```
Processing WAR…
-
If I crawl a website with mostly static resources, I'm noticing there can be missing resources in the resulting WARC. The reason for that is either broken links or timeouts.
I have written tools to…
-
Following up on #270, we want to continue migrating backups from S3 to B2. This should include:
- [x] rss-fetcher postgres backups (old files migrated, new files written to B2)
- [x] start writin…
-
WARC support would be great. It's used at-scale web archives across the world as the standard file format for web archiving. More information at https://en.wikipedia.org/wiki/WARC_(file_format)
Mos…
-
Motivation:
* To allow the recording of messages using a different representation to their wire message format as
- the write protocol may be suboptimal for the purposes of storage and replay; or
…
-
CombineWARC seems to create warcs from all the warcs in the folder after one run, but there is no way to create limited size warcs out of one run?
For example: if one has one crawl running daily, …
-
Currently warcat gives the following error on revisit records from a deduplicated WARC:
```Record failed validation
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-package…
-
The thing is being parsed correctly from the config file, and it's being instantiated.
But it's not preventing the same URL from being revisited inside the moratorium specified.