WebCuratorTool / webcurator

The root of the webcurator tool project, containing all modules needed to run a fully functional webcurator tool.
Apache License 2.0
1 stars 1 forks source link

Support ARC format in visualisation tool #136

Open hannakoppelaar opened 2 months ago

hannakoppelaar commented 2 months ago

At the KB we still have a large collection of ARC files that we would like to be able to inspect using the visualisation tool. Currently, the tool only indexes files with .warc and .warc.gz suffixes.

It seems it should be possible to support the ARC format without too much effort. It would entail widening the file filter criteria and making IndexProcessorWarc.java more general (there's already an IndexProcessorArc.java, but that's not being used and it's probably not necessary to differentiate between WARC and ARC at that level).