facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
https://velox-lib.io/
Apache License 2.0
3.2k stars 1.06k forks source link

Wrong result while Iceberg table read with Positional Delete #9856

Open agrawalreetika opened 2 weeks ago

agrawalreetika commented 2 weeks ago

Bug description

When we have a data file & position delete file with multiple rowgroups then the result count is not coming right. So if I try to DELETE all the rows from the original table, and try to select count from the resulted table it returns a non-zero count.

Steps To Reproduce - Execution Engine - Presto Data Format - PARQUET

presto> set session iceberg.parquet_writer_block_size='2MB';

presto> CREATE TABLE customer_v2 WITH (format = 'PARQUET') AS SELECT * FROM tpch.sf1.customer;
presto> select count(*), max(custkey), min(custkey) from  customer_v2;
 _col0  | _col1  | _col2 
--------+--------+-------
 150000 | 150000 |     1 
(1 row)

presto> DELETE FROM customer_v2 WHERE custkey >=1 AND custkey<=150000; 

Current Query Output -

presto> select count(*) from customer_v2;
 _col0 
-------
 57413 
(1 row)

Expected Query Output -

presto> select count(*) from customer_v2;
 _col0 
-------
0
(1 row)

System information

NA

Relevant logs

No response

Yuhta commented 2 weeks ago

@yingsu00 Can you take a look?

The count should be subtracted here, make sure the bitmask is created correctly: https://github.com/facebookincubator/velox/blob/e8244c2e0fd0fcdb80372b126859493c57382ef0/velox/dwio/common/SelectiveStructColumnReader.cpp#L64

yingsu00 commented 2 weeks ago

Thanks @Yuhta for looking into this. I will work on the fix.