archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Convert RecordLoader.loadArchives to a Spark Data Source #371

Closed ruebot closed 2 years ago

ruebot commented 4 years ago

Since we're pivoting to full DataFrame support (#223, #190), we should convert/migrate RecordLoader.loadArchives, and any other related functions to a Spark Data Source. That way we could do things like:

spark.read.format("webArchive")
  .option("mode", "FAILFAST")
  .option("inferSchema", "true")
  .option("/path/to/files")
  .schema(someSchema)
  .load()

Then, we could, (since it's an open issue #147) write WARCs that way too? :man_shrugging:

spark.write.format("webArchive")
  .option("mode", "OVERWRITE")
  .option("/path/to/files")
  .save()

These are the Spark core data sources:

Community implemented data sources:

ruebot commented 4 years ago

Some helpful links:

sepastian commented 4 years ago

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataSourceV2.html https://github.com/spirom/spark-data-sources

ruebot commented 3 years ago

Cassandra example

ruebot commented 2 years ago

I'm thinking this is out of scope for this project given the work being done on #494 now. So, I'm going to close it as won't fix.