When running the plugin with Spark 4+, if a Parquet file is being read with a read-schema that contains wider types than the Parquet file's schema, the read should not fail.
Details:
This is with reference to https://github.com/apache/spark/pull/44368. Spark 4 has the ability to read Parquet files where the read-schema uses wider types than the write-schema in the file.
For instance, a Parquet file with an Integer column a should be readable with a read-schema that defines a as having a type Long.
Prior to Spark 4, this would yield a `SchemaColumnConvertNotSupportedException on Apache Spark and the plugin. After https://github.com/apache/spark/pull/44368, if the read-schema uses a wider, compatible type, there is an implicit conversion to the wider data type during the read. An incompatible type continues to fail as before.
spark-rapids's parquet_test.py::test_parquet_check_schema_compatibility integration test currently looks as follows:
Spark 4's change in behaviour causes this test to fail thus:
"""
> with pytest.raises(Exception) as excinfo:
E Failed: DID NOT RAISE <class 'Exception'>
../../../../integration_tests/src/main/python/asserts.py:650: Failed
TL;DR:
When running the plugin with Spark 4+, if a Parquet file is being read with a read-schema that contains wider types than the Parquet file's schema, the read should not fail.
Details:
This is with reference to https://github.com/apache/spark/pull/44368. Spark 4 has the ability to read Parquet files where the read-schema uses wider types than the write-schema in the file.
For instance, a Parquet file with an
Integer
columna
should be readable with a read-schema that definesa
as having a typeLong
.Prior to Spark 4, this would yield a `SchemaColumnConvertNotSupportedException on Apache Spark and the plugin. After https://github.com/apache/spark/pull/44368, if the read-schema uses a wider, compatible type, there is an implicit conversion to the wider data type during the read. An incompatible type continues to fail as before.
spark-rapids
'sparquet_test.py::test_parquet_check_schema_compatibility
integration test currently looks as follows:Spark 4's change in behaviour causes this test to fail thus: