facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.
Apache License 2.0
3.2k stars 1.06k forks source link

Wrong result while Iceberg table read with Positional Delete #9856

Open agrawalreetika opened 2 weeks ago

agrawalreetika commented 2 weeks ago

Bug description

When we have a data file & position delete file with multiple rowgroups then the result count is not coming right. So if I try to DELETE all the rows from the original table, and try to select count from the resulted table it returns a non-zero count.

Steps To Reproduce - Execution Engine - Presto Data Format - PARQUET

presto> set session iceberg.parquet_writer_block_size='2MB';

presto> CREATE TABLE customer_v2 WITH (format = 'PARQUET') AS SELECT * FROM tpch.sf1.customer;
presto> select count(*), max(custkey), min(custkey) from  customer_v2;
 _col0  | _col1  | _col2 
 150000 | 150000 |     1 
(1 row)

presto> DELETE FROM customer_v2 WHERE custkey >=1 AND custkey<=150000; 

Current Query Output -

presto> select count(*) from customer_v2;
(1 row)

Expected Query Output -

presto> select count(*) from customer_v2;
(1 row)

System information


Relevant logs

No response

Yuhta commented 2 weeks ago

@yingsu00 Can you take a look?

The count should be subtracted here, make sure the bitmask is created correctly: https://github.com/facebookincubator/velox/blob/e8244c2e0fd0fcdb80372b126859493c57382ef0/velox/dwio/common/SelectiveStructColumnReader.cpp#L64

yingsu00 commented 2 weeks ago

Thanks @Yuhta for looking into this. I will work on the fix.