Closed liuchunhua closed 5 days ago
Can we try Velox directly? @rui-mo
@liuchunhua @FelixYBW Sure. I will take a look.
I can reproduce this issue in spark 3.3.1.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.3.1
/_/
Using Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 1.8.0_412)
Type in expressions to have them evaluated.
Type :help for more information.
scala> spark.read.parquet("/user/hive/warehouse/bigdata.db/tmp_decimal_col/part-00000-4bb6f91f-c039-493c-8617-10f289fc063e-c000.snappy.parquet").show()
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
|amt_int| amt_bigint|amt_decimal8| amt_decimal16| amt_decimal24| amt_decimal27| amt_decimal32| amt_decimal38|
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
| 2322|223243243224| 2343.33|343423423333.12|32423432432432324...|90700136833857145...|32423432432432324...|32423432432432324...|
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
The latest code should have fixed the issue. The reason for the error was the incorrect handling of int128_t
parse logic in IntDecoder
for reading the int96 timestamp. This issue would only be triggered by decimal(27,s) and decimal(28,s).
cc @rui-mo @FelixYBW.
@liuchunhua You could pick https://github.com/facebookincubator/velox/commit/da39954c22f1dda1d4b78dae078fb1307fc2508f to replace the original INT96 timestamp reader support in OAP/Velox to fix this issue. Thanks.
thanks @rui-mo the issue was not be fixed by picking https://github.com/facebookincubator/velox/commit/da39954c22f1dda1d4b78dae078fb1307fc2508f
the reason of the error was the incorret handing of INT96 in https://github.com/oap-project/velox/blob/88856e6b139c761e7876b1cd3b29e8dad236d8c7/velox/dwio/common/DirectDecoder.h#L92-L105, the "numBytes" of decimal(27,2) is same as timestamp. I fixed the issue by this code
if ((super::numBytes == 12) && (std::is_same_v<typename Visitor::FilterType, velox::common::TimestampRange>)/* INT96 */) {
// add log
I20241117 10:26:43.812453 75535 PageReader.cpp:55] page type: DATA_PAGE
I20241117 10:26:43.812553 75535 PageReader.cpp:648] encoding: PLAIN page type: FIXED_LEN_BYTE_ARRAY
I20241117 10:26:43.812584 75535 IntDecoder.h:440] userVInts:0bigEndian: 1numBytes:12
I20241117 10:26:43.812595 75535 IntDecoder.h:359] sizeof : 16
I20241117 10:26:43.812602 75535 IntDecoder.h:382] numBytes: 12
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
|amt_int| amt_bigint|amt_decimal8| amt_decimal16| amt_decimal24| amt_decimal27| amt_decimal32| amt_decimal38|
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
| 2322|223243243224| 2343.33|343423423333.12|32423432432432324...|32423432432432324...|32423432432432324...|32423432432432324...|
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
the issue was not be fixed by picking facebookincubator/velox@da39954
the reason of the error was the incorret handing of INT96 in https://github.com/oap-project/velox/blob/88856e6b139c761e7876b1cd3b29e8dad236d8c7/velox/dwio/common/DirectDecoder.h#L92-L105, the "numBytes" of decimal(27,2) is same as timestamp.
@liuchunhua Please note that the commit https://github.com/oap-project/velox/commit/07e7fd73c2eeaf9d4b25853ed8fca2d626b0675e you mentioned above is out-of-date in the latest code. As you mentioned, the reason for the incorrect result is the "numBytes" of decimal(27,2) is same as timestamp
, while in the new code the timestamp decoding is moved out from DirectDecoder, so the decoding of INT96 should work for both timestamp and decimal(27, 2).
Backend
VL (Velox)
Bug description
write
使用spark 3.2.4 原生版本生成文件
read
spark3.4.4 + gluten+ velox: spark.read.parquet("/user/hive/warehouse/bigdata.db/tmp_decimal_col/part-00000-b1079bc2-7027-4b3b-99cb-83a187c30883-c000.snappy.parquet").show()
+-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+ |amt_int| amt_bigint|amt_decimal8| amt_decimal16| amt_decimal24| amt_decimal27| amt_decimal32| amt_decimal38| +-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+ | 2322|223243243224| 2343.33|343423423333.12|32423432432432324...|90700136833857145...|32423432432432324...|32423432432432324...| +-------+------------+------------+---------------+--------------------+--------------------+--------------------+--------------------+
gluten version
Backend: Velox Backend Branch: HEAD Backend Revision: 88856e6b139c761e7876b1cd3b29e8dad236d8c7 Backend Revision Time: 2024-08-20 16:45:45 +0800 GCC Version: gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2) Gluten Branch: branch-1.2 Gluten Build Time: 2024-08-21T13:12:51Z Gluten Repo URL: https://github.com/apache/incubator-gluten Gluten Revision: c82af60a635fb52bf1314cd831f5915dd37d910d Gluten Revision Time: 2024-08-21 16:12:39 +0800 Gluten Version: 1.2.0 Hadoop Version: 2.7.4 Java Version: 1.8 Scala Version: 2.12.15 Spark Version: 3.4.2
Spark version
Spark-3.4.x
Spark configurations
No response
System information
No response
Relevant logs
No response