Open j7nhai opened 3 months ago
Can you provide some debugging suggestions? @liujiayi771
@j7nhai This is not related to Iceberg reader; rather, it's that Velox is unable to handle your parquet file. Which engine wrote the iceberg table? I have encountered this issue when reading timestamps written by StarRocks, where the parquet timestamps written by SR lacked some information.
You can use duckdb to read the schema of this parquet file. It may also be a specific timestamp type that is not yet supported by Velox.
select * from parquet_schema('xx.parquet');
@j7nhai This is not related to Iceberg reader; rather, it's that Velox is unable to handle your parquet file. Which engine wrote the iceberg table? I have encountered this issue when reading timestamps written by StarRocks, where the parquet timestamps written by SR lacked some information.
@liujiayi771
I wrote the iceberg table by spark-3.3 + iceberg-1.3.0. Here is the code I created the test data.
spark-sql> CREATE TABLE timestamp_test (
> timestamp_col TIMESTAMP
> ) USING iceberg;
Time taken: 0.029 seconds
spark-sql>
> INSERT INTO timestamp_test VALUES (
> TIMESTAMP '2022-01-01 00:00:00'
> );
So the parquet file is created by iceberg. Maybe Velox is not able to handle the parquet files created by iceberg with timestamp? Or there is something wrong for substrait? I don't have any debugging idea right now.
@j7nhai Can you use duckdb to read the schema information of this parquet file? I can help you take a look at it. I updated the duckdb sql in my previous response.
D select * from parquet_schema("/data/j7nhai/wh/my_test_database/timestamp_test/data/00000-13-073df184-31d4-445c-aeef-b3cc036c693e-00001.parquet");
+-----------------------------------------------------------------------------------------------------------------------------+---------------+-------+-------------+-----------------+--------------+------------------+-------+-----------+----------+-----------------------------------------------------------------------------------------------------+
| file_name | name | type | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type |
+-----------------------------------------------------------------------------------------------------------------------------+---------------+-------+-------------+-----------------+--------------+------------------+-------+-----------+----------+-----------------------------------------------------------------------------------------------------+
| /data/j7nhai/wh/my_test_database/timestamp_test/data/00000-13-073df184-31d4-445c-aeef-b3cc036c693e-00001.parquet | table | | | | 1 | | | | | |
| /data/j7nhai/wh/my_test_database/timestamp_test/data/00000-13-073df184-31d4-445c-aeef-b3cc036c693e-00001.parquet | timestamp_col | INT64 | | REQUIRED | | TIMESTAMP_MICROS | | | 1 | TimestampType(isAdjustedToUTC=1, unit=TimeUnit(MILLIS=<null>, MICROS=MicroSeconds(), NANOS=<null>)) |
+-----------------------------------------------------------------------------------------------------------------------------+---------------+-------+-------------+-----------------+--------------+------------------+-------+-----------+----------+-----------------------------------------------------------------------------------------------------+
@liujiayi771 thx!
@j7nhai For int64 timestamps, you'll need to wait for this pull request to be merged. https://github.com/facebookincubator/velox/pull/8325
@j7nhai For int64 timestamps, you'll need to wait for this pull request to be merged. facebookincubator/velox#8325
@liujiayi771 thx, but i am confused that if the partition column is timestamp, the file can be read by native even if it's int64.
the substrait:
{
"relations": [{
"root": {
"input": {
"read": {
"common": {
"direct": {
}
},
"baseSchema": {
"names": ["col"],
"struct": {
"types": [{
"timestamp": {
"nullability": "NULLABILITY_NULLABLE"
}
}]
},
"columnTypes": ["PARTITION_COL"]
},
"advancedExtension": {
"optimization": {
"@type": "/google.protobuf.StringValue",
"value": "isMergeTree\u003d0\n"
}
}
}
},
"names": ["col#1", "col#1"]
}
}]
}
and the schema from duckdb:
select * from parquet_schema("/data/j7nhai/wh/my_test_databases/part_ts/data/col=2021-12-31T16%3A00Z/00154-1-891d43cb-312e-49d2-b88c-880f1fe026f8-00001.parquet");
+---------------------------------------------------------------------------------------------------------------------------------------------+-------+-------+-------------+-----------------+--------------+------------------+-------+-----------+----------+-----------------------------------------------------------------------------------------------------+
| file_name | name | type | type_length | repetition_type | num_children | converted_type | scale | precision | field_id | logical_type |
+---------------------------------------------------------------------------------------------------------------------------------------------+-------+-------+-------------+-----------------+--------------+------------------+-------+-----------+----------+-----------------------------------------------------------------------------------------------------+
| /data/j7nhai/wh/my_test_databases/part_ts/data/col=2021-12-31T16%3A00Z/00154-1-891d43cb-312e-49d2-b88c-880f1fe026f8-00001.parquet | table | | | | 1 | | | | | |
| /data/j7nhai/wh/my_test_databases/part_ts/data/col=2021-12-31T16%3A00Z/00154-1-891d43cb-312e-49d2-b88c-880f1fe026f8-00001.parquet | col | INT64 | | REQUIRED | | TIMESTAMP_MICROS | | | 1 | TimestampType(isAdjustedToUTC=1, unit=TimeUnit(MILLIS=<null>, MICROS=MicroSeconds(), NANOS=<null>)) |
+---------------------------------------------------------------------------------------------------------------------------------------------+-------+-------+-------------+-----------------+--------------+------------------+-------+-----------+----------+-----------------------------------------------------------------------------------------------------+
@j7nhai For partition column, gluten will read the partition values and pass them to velox, velox will not read them from the parquet file.
Backend
VL (Velox)
Bug description
throw exception when iceberg with single timestamp column.
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response