Closed ruebot closed 2 years ago
Since we're pivoting to full DataFrame support (#223, #190), we should convert/migrate RecordLoader.loadArchives, and any other related functions to a Spark Data Source. That way we could do things like:
RecordLoader.loadArchives
spark.read.format("webArchive") .option("mode", "FAILFAST") .option("inferSchema", "true") .option("/path/to/files") .schema(someSchema) .load()
Then, we could, (since it's an open issue #147) write WARCs that way too? :man_shrugging:
spark.write.format("webArchive") .option("mode", "OVERWRITE") .option("/path/to/files") .save()
These are the Spark core data sources:
Community implemented data sources:
Some helpful links:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-DataSourceV2.html https://github.com/spirom/spark-data-sources
Cassandra example
I'm thinking this is out of scope for this project given the work being done on #494 now. So, I'm going to close it as won't fix.
Since we're pivoting to full DataFrame support (#223, #190), we should convert/migrate
RecordLoader.loadArchives
, and any other related functions to a Spark Data Source. That way we could do things like:Then, we could, (since it's an open issue #147) write WARCs that way too? :man_shrugging:
These are the Spark core data sources:
Community implemented data sources: