AbsaOSS / spline-spark-agent

Spline agent for Apache Spark
https://absaoss.github.io/spline/
Apache License 2.0
185 stars 95 forks source link

Cant track org.apache.hadoop.fs.rename #760

Open vinhnemo opened 1 year ago

vinhnemo commented 1 year ago

Hi Folks,

Anyone has trouble with problems since the Spark Job includes many write and file rename operators (org.apache.hadoop.fs.rename). This situation made the lineage correct. Please help me if you have faced this.

Context:

My case:

write('hdfs://abc/tmp/123');
write('hdfs://xyz/tmp/123');
write('hdfs://asd/tmp/123');
rename('hdfs://abc/tmp/123','hdfs://abc/123');
rename('hdfs://xyz/tmp/123','hdfs://xyz/123');
rename('hdfs://asd/tmp/123','hdfs://asd/123');

My current approach is to implement a mapping job by using Hadoop audit logs(contains org.apache.hadoop.fs.rename``) to correct Spline'swrite/read operators`