archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Replace Java ARC/WARC record processing library #494

Closed ruebot closed 2 years ago

ruebot commented 4 years ago

Is your feature request related to a problem? Please describe.

We have a number of issues that have crept up over years with how we process ARC and WARC records to hand off to Spark for processing. Namely #317, #492, and #493.

Describe the solution you'd like

Write a new Scala library to handle processing ARC and WARC. This can be part of aut or and stand alone library, or we can use/built upon @helgeho's sparkling.

Describe alternatives you've considered

Fixing and patching what we have now, and potentially jwarc (#411).

Additional context

Implementing this as a data source could also lead to addressing #371 completely. From the Spark dev list, I believe this is an example of implementing Cassandra as a data source that we can potentially build off of.

lintool commented 4 years ago

FWIW, Common Crawl seems to use the ClueWeb WARC readers https://github.com/commoncrawl/example-warc-java/tree/master/src/main/java

These are also the ones used in Anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/ClueWeb09Collection.java

My impression is that these readers are much more impoverished in terms of features... but may be much faster?