-
I try to get rid of duplicate pages as follows:
```
val r = RecordLoader.loadArchives("/directory/to/arc/file.arc.gz", sc)
.keepValidPages()
.groupBy(_.getUrl).values.map(_.head) // remove dup…
-
It seems that shiori depends on [warc](https://github.com/go-shiori/warc) which is currently archived. We need to find a replacement for warc. Maybe [obelisk](https://github.com/go-shiori/obelisk)?
#…
-
Hello,
Thank you for all of your great work. I am trying to just download and process the English dumps from CommonCrawl up to 2023. I have been running into multiple errors.
It seems as if the …
-
I wanted to offer some thoughts on the /webdata endpoint in general and some possible areas of improvement for supporting other services, such as Webrecorder.
One issue that I see is the time for h…
-
I'm currently trying to decompress [this archive file](https://archive.org/download/archiveteam_youtubedislikes_20211211123803_62132b09/youtubedislikes_20211211123803_62132b09.1638107855.megawarc.warc…
-
Being able to index and re-index collections that are located on remote storage (S3) would be very helpful.
-
We use pywb to serve a ~ 488 MB WARC file (https://webrecorder.io/layoutanalysis/2015_2016) to [scrapy](https://scrapy.org/)/[splash](http://splash.readthedocs.io/en/stable/), which injects a layout a…
-
The request records in the CC-NEWS WARC files lack the HTTP protocol version:
```
GET /path
```
instead of
```
GET /path HTTP/1.1
```
This makes some WARC parsers fail to process the WARC fil…
-
The [WARC/1.1](http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1-1_latestdraft.pdf) spec (Section B.8) gives an example where a response record is segmented into multiple other smaller records. This c…
-
Rather than our own `webrender-api`, consider switching to https://github.com/webrecorder/browsertrix-crawler
The integration pattern is somewhat different to Browsertrix's primary use case, but it…