AbsaOSS / hyperdrive

Extensible streaming ingestion pipeline on top of Apache Spark
Apache License 2.0
44 stars 13 forks source link

Write tool that fixes paths in spark_metadata #235

Closed kevinwallimann closed 3 years ago

kevinwallimann commented 3 years ago

Background The _spark_metadata folder contains absolute paths to parquet files. Therefore, when the folder is moved (e.g. to S3), these folders are invalid. A log file can look like this

v1
{"path":"hdfs://a/b/c/HyperDrive/topic.name/info_date=2021-08-20/info_version=1/part-00000-723bc214-e3b5-420a-9089-a59f1fd238cd-c000.snappy.parquet","size":383248,"isDir":false,"modificationTime":1619529808057,"blockReplication":2,"blockSize":134217728,"action":"add"}

Proposed solution A tool should be written (possibly independent from Hyperdrive), which fixes the paths in the _spark_metadata folder based on the location of the _spark_metadata folder. I.e. in the above example, if the folder sits at s3://some_bucket/folderA/_spark_metadata, then the log file should be changed to

v1
{"path":"s3://some_bucket/folderA/info_date=2021-08-20/info_version=1/part-00000-723bc214-e3b5-420a-9089-a59f1fd238cd-c000.snappy.parquet","size":383248,"isDir":false,"modificationTime":1619529808057,"blockReplication":2,"blockSize":134217728,"action":"add"}

Hint

Zejnilovic commented 3 years ago

Can we have it as a standalone repository and tool?

kevinwallimann commented 3 years ago

Closing this issue. See https://github.com/AbsaOSS/spark-metadata-tool