Convert RecordLoader.loadArchives to a Spark Data Source

ruebot commented 4 years ago

Since we're pivoting to full DataFrame support (#223, #190), we should convert/migrate RecordLoader.loadArchives, and any other related functions to a Spark Data Source. That way we could do things like:

spark.read.format("webArchive")
  .option("mode", "FAILFAST")
  .option("inferSchema", "true")
  .option("/path/to/files")
  .schema(someSchema)
  .load()

Then, we could, (since it's an open issue #147) write WARCs that way too? :man_shrugging:

spark.write.format("webArchive")
  .option("mode", "OVERWRITE")
  .option("/path/to/files")
  .save()

These are the Spark core data sources:

CSV
JSON
Parquet
ORC
JDBC/ODBC
Plain-text
Avro

Community implemented data sources:

Cassandra
HBase
MongoDB
AWS Redshift
XML

ruebot commented 4 years ago

Some helpful links:

sepastian commented 4 years ago

https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataSourceV2.html https://github.com/spirom/spark-data-sources

ruebot commented 3 years ago

Cassandra example

ruebot commented 2 years ago

I'm thinking this is out of scope for this project given the work being done on #494 now. So, I'm going to close it as won't fix.

archivesunleashed / aut

Convert RecordLoader.loadArchives to a Spark Data Source #371