-
It seems that there is a desire to record provenance of WARC files, e.g. in the case of concatenation. See http://ws-dl.blogspot.co.uk/2014/09/2014-09-02-warcmerge-merging-multiple.html
That proposal…
-
While validating WARCs at the National Archives of the Netherlands we encountered the hopsFromSeed field. We could not find an explanation of the values, other than on Twitter or in source code of WAR…
-
An example might be to ask the user if they want the WARCs moved/copied to the archives folder, replay immediately in a certain engine, recrawl the URIs in the WARCs, etc.
-
I've created wacz file from warc.gz with latest py-warcz package 0.4.5
Original file https://cdn1.ruarxive.org/public/webcollect2022/ngo2022/cafrussia.ru/cafrussia.ru.warc.gz (179MB)
Produced WACZ …
ivbeg updated
2 years ago
-
Using this [WARC](http://www.cs.odu.edu/~mkelly/semester/2017_summer/reuters2.warc) displays many URIs in the interface. This is noisy and the order arbitrary (currently alphabetical? based on order i…
-
### Browsertrix Version
v1.11.3-12f994b
### What did you expect to happen? What happened instead?
When you download wacz files using the API you get wacz filenames like "20230225142507561-manual-20…
-
Our current workflow for generating summary catalogue records for the web archive crawls (so-called 'title-level records') works like this:
- Index everything into Solr using warc-indexer.
- Run Metad…
-
Add an Airflow DAG, based on the `rclone/rclone` Docker image, running e.g.
rclone copy --hdfs-namenode h3nn.wa.bl.uk:54310 --hdfs-username ingest --max-age 24h --no-traverse /mnt/gluster/fc/h…
-
There is a known bug where warc-extractor.py does not handle windows paths properly.
Windows has far more restrictions on what is an appropriate path than Linux. Unfortunately, dealing with all of t…
recrm updated
2 years ago
-
The W3CDTF reference appears in the definitions of `WARC-Date` and `WARC-Refers-To-Date` but somehow vanished from the normative references list in the preparation of version 1.1.
> [W3CDTF] Date a…