-
When looking at this project, I saw that dynamically linking to an "archive" of a website via URLs, if/after I set up IPWB on a subdomain or a website, is not really possible, as IPWB sets itself up t…
-
We want to be able to obtain all web and media content associated with a specific list pre-identified domain names.
This issue tracks domain names identified in the [**BigScience Data Cataloging Ev…
-
I had to downgrade Guava to accomodate UKWA's WARC Hadoop indexer:
https://github.com/lintool/warcbase/blob/master/pom.xml#L277
But this issue now appears to be fixed:
https://github.com/ukwa/webarch…
-
In July @vbanos reported invalid gzip data in a warc written by warcprox with `--writer-threads=5`.
My benchmarking suggests that 1 writer thread is optimal:
https://github.com/internetarchive/war…
-
.zip
or
.tgz
or
.warc
-
As suggested by Noah Levitt @ internet archive.
-
![image](https://cloud.githubusercontent.com/assets/10646050/20856432/c334272e-b8cb-11e6-8dfa-cbd0a5e92e32.png)
-
Hi,
I'm trying to read the ClueWeb09 warc file but there is not data emitted nor error.
It seems that ClueWeb's separator is different from standard warc files, I have forked this repository for [chan…
dod91 updated
8 years ago
-
Although WARCs created by Larch has support for streams, we're currently unable to use it for the server.
The issue is this:
1. We use a ReadSeeker to be able to scrub in a stream
2. If the arc…
-
Is there a config option for splitting out downloaded files into their own warc files instead of going into the same one?
This will allow for easier data extraction based on individual items