apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

Incorrect behavior when reading unsigned integers from Parquet #1067

Closed andygrove closed 2 days ago

andygrove commented 1 week ago

Describe the bug

In makeParquetFileAllTypes we create some columns with unsigned types:

  optional int32                    _9(UINT_8);
  optional int32                    _10(UINT_16);
  optional int32                    _11(UINT_32);
  optional int64                    _12(UINT_64);

In existing tests, we have TODO comments and skip testing these columns. For example, in CometExpressionSuite we have:

  test("round") {
    ...
        withParquetTable(path.toString, "tbl") {
          for (s <- Seq(-5, -1, 0, 1, 5, -1000, 1000, -323, -308, 308, -15, 15, -16, 16, null)) {
            // array tests
            // TODO: enable test for unsigned ints (_9, _10, _11, _12)
            // TODO: enable test for floats (_6, _7, _8, _13)
            for (c <- Seq(2, 3, 4, 5, 9, 15, 16, 17)) {
              checkSparkAnswerAndOperator(s"select _${c}, round(_${c}, ${s}) FROM tbl")
            }

I tried enabling these tests and see that it results in incorrect results.

== Results ==
!== Correct Answer - 356 ==                   == Spark Answer - 356 ==
 struct<_9:smallint,round(_9, -5):smallint>   struct<_9:smallint,round(_9, -5):smallint>
![-1,0]                                       [0,0]
![-100,0]                                     [0,0]
![-103,0]                                     [0,0]
![-105,0]                                     [0,0]
...

Steps to reproduce

No response

Expected behavior

No response

Additional context

No response

kazuyukitanimura commented 6 days ago

Related https://github.com/apache/datafusion-comet/blob/main/spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala#L122

  test("basic data type support") {
    Seq(true, false).foreach { dictionaryEnabled =>
      withTempDir { dir =>
        val path = new Path(dir.toURI.toString, "test.parquet")
        makeParquetFileAllTypes(path, dictionaryEnabled = dictionaryEnabled, 10000)
        withParquetTable(path.toString, "tbl") {
          // TODO: enable test for unsigned ints
          checkSparkAnswerAndOperator(
            "select _1, _2, _3, _4, _5, _6, _7, _8, _13, _14, _15, _16, _17, " +
              "_18, _19, _20 FROM tbl WHERE _2 > 100")
        }
      }
    }
  }