Closed zerodarkzone closed 9 months ago
It seems like the API InternalRow.setLong()
doesn't work in certain cases. Do you have any settings related to the parquet reader when reading from Delta 2.4? If there is a small repro case, it would help us debug the issue.
Hi, I'm gonna try to setup a small repo to replicate de issue. But for now, these are some of the configurations we are using related to parquet.
spark.sql.legacy.timeParserPolicy: LEGACY
spark.sql.parquet.int96RebaseModeInRead: CORRECTED
spark.sql.parquet.int96RebaseModeInWrite: CORRECTED
spark.sql.parquet.datetimeRebaseModeInWrite: CORRECTED
spark.sql.parquet.datetimeRebaseModeInRead: CORRECTED
@zerodarkzone do you have spark.sql.parquet.enableVectorizedReader
disabled? Also, what are the data types of columns in your table?
I dont have it disabled. We have integers, strings, timestamps, dates, some decimals and two columns in particular are an array of structs which contains some decimals and strings.
@andreaschat-db was able to repro this issue. This is happening for wide tables which have more than 100 columns. When we are reading more than 100 columns, Sparks code generator makes a decision (1, 2) to not use the codegen. When not using codegen, Spark sets options to get rows instead of columnar batches from the Parquet reader. This causes the vectorized Parquet reader to return row abstraction over each column in the columnar batch. This row abstraction doesn't allow modification of the contents.
I will be posting a fix shortly. A couple of workarounds:
1) disable the vectorized parquet reader: spark.sql.parquet.enableVectorizedReader=false
2) Set the table width threshold for codegen to a high number (depending upon the number of columns in your table): spark.sql.codegen.maxFields
(default value is 100).
Bug
Which Delta project/connector is this regarding?
Describe the problem
I have a delta table with deletion vectors and with the following features enabled written by Databricks (Databricks-Runtime/13.3.x-photon-scala2.12)
When I try to read it using pyspark 3.4.1 with delta-lake 2.4.0 which I think is compliant with the requested reader protocol versión, I'm getting a
java.lang.UnsupportedOperationException
error.Steps to reproduce
Observed results
When trying to do a simple "show" command, it responds with the following error:
Expected results
It should show the content of the table.
Environment information
Willingness to contribute
The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?