We found an issue where the Parquet filter pushdown modifies the hadoop Configuration which is shared among multiple threads. That modification leads to race conditions in each of the threads and can cause the wrong filter conditions to be used on the wrong files and when those files have different schema's - specifically dealing with INT96 vs timestamp types in parquet, it causes failures like:
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: FilterPredicate column: clientTs's declared type (java.lang.Long) does not match the schema found in file metadata. Column clientTs is of type: INT96
Valid types for this column are: [class org.apache.parquet.io.api.Binary]
The fix to this is just make a copy of the hadoop Configuration before its modified.
I checked Orc and csv and didn't see the same issues there. I manually tested on the customer query that was failing and this fixed it. I also wrote a unit test which has one file with a column of INT96 and one column with type int64 (timestamp micros) and use a filter condition with it that reproduces the issue and race condition. Without the fix to create the new Configuration object the tests fail.
fixes https://github.com/NVIDIA/spark-rapids/issues/11622
We found an issue where the Parquet filter pushdown modifies the hadoop Configuration which is shared among multiple threads. That modification leads to race conditions in each of the threads and can cause the wrong filter conditions to be used on the wrong files and when those files have different schema's - specifically dealing with INT96 vs timestamp types in parquet, it causes failures like:
The fix to this is just make a copy of the hadoop Configuration before its modified.
I checked Orc and csv and didn't see the same issues there. I manually tested on the customer query that was failing and this fixed it. I also wrote a unit test which has one file with a column of INT96 and one column with type int64 (timestamp micros) and use a filter condition with it that reproduces the issue and race condition. Without the fix to create the new Configuration object the tests fail.