apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
759 stars 150 forks source link

Comet cannot read decimals with physical type BINARY #567

Open comphead opened 3 months ago

comphead commented 3 months ago

Describe the bug

The user raised the issue when Comet crashes on

Column: [price], Expected: decimal(15,2), Found: BINARY.

when reading the parquet file

The parquet file meta is

  optional binary price (DECIMAL(15,2));

Spark without Comet reads the data with no issues

Steps to reproduce

No response

Expected behavior

Should read the value

Additional context

No response

parthchandra commented 3 months ago

I'll look into this @comphead.

comphead commented 3 months ago

Thanks @parthchandra the issue is likely in org.apache.comet.parquet.TypeUtil.checkParquetType when deriving the decimal type

parthchandra commented 3 months ago

Update on this - Spark vectorized reader also throws the same error. Users have to turn off vectorized reading to read such files. It is also pretty near impossible to write a binary decimal field (as opposed to a fixed length byte-array field) using Spark. One has to use the Parquet writer or some other project (avro for example) to write such fields. In Comet there is in fact no implementation to decode a binary decimal field just like there is none in the Spark vectorized reader. It should be possible to implement, but I'm wondering if this is a niche case. @comphead

andygrove commented 1 week ago

@comphead @parthchandra can we close this issue?

comphead commented 1 week ago

Well, the issue still exists, however its related to deprecated Parquet formats where Decimal is represented as BINARY. We probably should mention this in doc that such kind of conversions are not supported

parthchandra commented 1 week ago

Yes, let's close this. We can revisit this if more people report it.