Closed ruebot closed 2 years ago
FWIW, Common Crawl seems to use the ClueWeb WARC readers https://github.com/commoncrawl/example-warc-java/tree/master/src/main/java
These are also the ones used in Anserini: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/collection/ClueWeb09Collection.java
My impression is that these readers are much more impoverished in terms of features... but may be much faster?
Is your feature request related to a problem? Please describe.
We have a number of issues that have crept up over years with how we process ARC and WARC records to hand off to Spark for processing. Namely #317, #492, and #493.
Describe the solution you'd like
Write a new Scala library to handle processing ARC and WARC. This can be part of
aut
or and stand alone library, or we can use/built upon @helgeho'ssparkling
.Describe alternatives you've considered
Fixing and patching what we have now, and potentially jwarc (#411).
Additional context
Implementing this as a data source could also lead to addressing #371 completely. From the Spark dev list, I believe this is an example of implementing Cassandra as a data source that we can potentially build off of.