apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[VL] gluten1.2 read parquet v1 file has incorrect decimal(27,2) field values #7932

Closed liuchunhua closed 5 days ago

liuchunhua commented 1 week ago

Backend

VL (Velox)

Bug description

write

使用spark 3.2.4 原生版本生成文件

create table bigdata.tmp_decimal_col(
amt_int int,
amt_bigint long,
amt_decimal8 decimal(8, 2),
amt_decimal16 decimal(16, 2),
amt_decimal24 decimal(24,2),
amt_decimal27 decimal(27, 2),
amt_decimal32 decimal(32, 2),
amt_decimal38 decimal(38, 2)
);

insert into bigdata.tmp_decimal_col values(2322, 223243243224, 
2343.33, 343423423333.12, 
3242343243243232423.00,
3242343243243232423.00,
3242343243243232423.00,
3242343243243232423.00);

read

spark3.4.4 + gluten+ velox: spark.read.parquet("/user/hive/warehouse/bigdata.db/tmp_decimal_col/part-00000-b1079bc2-7027-4b3b-99cb-83a187c30883-c000.snappy.parquet").show()

+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+ |amt_int| amt_bigint|amt_decimal8| amt_decimal16| amt_decimal24| amt_decimal27| amt_decimal32| amt_decimal38| +-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+ | 2322|223243243224| 2343.33|343423423333.12|32423432432432324...|90700136833857145...|32423432432432324...|32423432432432324...| +-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+

gluten version

Backend: Velox Backend Branch: HEAD Backend Revision: 88856e6b139c761e7876b1cd3b29e8dad236d8c7 Backend Revision Time: 2024-08-20 16:45:45 +0800 GCC Version: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) Gluten Branch: branch-1.2 Gluten Build Time: 2024-08-21T13:12:51Z Gluten Repo URL: https://github.com/apache/incubator-gluten Gluten Revision: c82af60a635fb52bf1314cd831f5915dd37d910d Gluten Revision Time: 2024-08-21 16:12:39 +0800 Gluten Version: 1.2.0 Hadoop Version: 2.7.4 Java Version: 1.8 Scala Version: 2.12.15 Spark Version: 3.4.2

Spark version

Spark-3.4.x

Spark configurations

No response

System information

No response

Relevant logs

No response

FelixYBW commented 1 week ago

Can we try Velox directly? @rui-mo

rui-mo commented 1 week ago

@liuchunhua @FelixYBW Sure. I will take a look.

liujiayi771 commented 1 week ago

I can reproduce this issue in spark 3.3.1.

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.3.1
      /_/

Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_412)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.parquet("/user/hive/warehouse/bigdata.db/tmp_decimal_col/part-00000-4bb6f91f-c039-493c-8617-10f289fc063e-c000.snappy.parquet").show()
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
|amt_int|  amt_bigint|amt_decimal8|  amt_decimal16|       amt_decimal24|       amt_decimal27|       amt_decimal32|       amt_decimal38|
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
|   2322|223243243224|     2343.33|343423423333.12|32423432432432324...|90700136833857145...|32423432432432324...|32423432432432324...|
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
liujiayi771 commented 1 week ago

The latest code should have fixed the issue. The reason for the error was the incorrect handling of int128_t parse logic in IntDecoder for reading the int96 timestamp. This issue would only be triggered by decimal(27,s) and decimal(28,s). cc @rui-mo @FelixYBW.

rui-mo commented 1 week ago

@liuchunhua You could pick https://github.com/facebookincubator/velox/commit/da39954c22f1dda1d4b78dae078fb1307fc2508f to replace the original INT96 timestamp reader support in OAP/Velox to fix this issue. Thanks.

liuchunhua commented 5 days ago

thanks @rui-mo the issue was not be fixed by picking https://github.com/facebookincubator/velox/commit/da39954c22f1dda1d4b78dae078fb1307fc2508f

the reason of the error was the incorret handing of INT96 in https://github.com/oap-project/velox/blob/88856e6b139c761e7876b1cd3b29e8dad236d8c7/velox/dwio/common/DirectDecoder.h#L92-L105, the "numBytes" of decimal(27,2) is same as timestamp. I fixed the issue by this code

        if ((super::numBytes == 12) && (std::is_same_v<typename Visitor::FilterType, velox::common::TimestampRange>)/* INT96 */) {
// add log
I20241117 10:26:43.812453 75535 PageReader.cpp:55] page type: DATA_PAGE
I20241117 10:26:43.812553 75535 PageReader.cpp:648] encoding: PLAIN page type: FIXED_LEN_BYTE_ARRAY
I20241117 10:26:43.812584 75535 IntDecoder.h:440] userVInts:0bigEndian: 1numBytes:12
I20241117 10:26:43.812595 75535 IntDecoder.h:359] sizeof : 16
I20241117 10:26:43.812602 75535 IntDecoder.h:382] numBytes: 12

+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
|amt_int|  amt_bigint|amt_decimal8|  amt_decimal16|       amt_decimal24|       amt_decimal27|       amt_decimal32|       amt_decimal38|
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
|   2322|223243243224|     2343.33|343423423333.12|32423432432432324...|32423432432432324...|32423432432432324...|32423432432432324...|
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
rui-mo commented 4 days ago

the issue was not be fixed by picking facebookincubator/velox@da39954

the reason of the error was the incorret handing of INT96 in https://github.com/oap-project/velox/blob/88856e6b139c761e7876b1cd3b29e8dad236d8c7/velox/dwio/common/DirectDecoder.h#L92-L105, the "numBytes" of decimal(27,2) is same as timestamp.

@liuchunhua Please note that the commit https://github.com/oap-project/velox/commit/07e7fd73c2eeaf9d4b25853ed8fca2d626b0675e you mentioned above is out-of-date in the latest code. As you mentioned, the reason for the incorrect result is the "numBytes" of decimal(27,2) is same as timestamp, while in the new code the timestamp decoding is moved out from DirectDecoder, so the decoding of INT96 should work for both timestamp and decimal(27, 2).