LazyVector support for Parquet Reader

facebookincubator / velox

A C++ vectorized database acceleration library aimed to optimizing query engines and data processing systems.

https://velox-lib.io/

Apache License 2.0

3.47k stars 1.14k forks source link

LazyVector support for Parquet Reader #9563

Closed Yuhta closed 1 month ago

Yuhta commented 6 months ago

Description

Currently Parquet reader does not produce LazyVector, as isTopLevel() is always false for Parquet column readers, and advanceFieldReader is not implemented (this one maybe not needed). This results in large performance gap between DWRF/ORC and Parquet reader. Would be nice if we get the LazyVector working for Parquet reader.

yingsu00 commented 5 months ago

@Yuhta Thanks Jimmy! Yes I've been thinking of doing that. Just curious, what's the perf gap between DWRF and Parquet is like in your tests and what workload?

Yuhta commented 5 months ago

I have not measured, but the LazyVector mostly benefit when you have a non-pushdown filter (i.e. remaining expression) that on some smaller key columns, with very high filtering rate (>99.9%), and large lazy payload columns (nested row/array/map), then we can avoid reading majority of the payload column content. A typical example is some deterministic random sampling (e.g. where hash(id) % 1000 = 0).

yingsu00 commented 5 months ago

Got it thanks @Yuhta ! We will plan to add it later this year.

jaystarshot commented 1 month ago

@Yuhta Is there any documentation on what LazyVector exactly is? Wanted to understand how it is decreasing the data read

jaystarshot commented 1 month ago

I saw the comment here. From the comments looks like if there is filter pushdown in scan, lazy loading can decrease some loading into memory. However I am confused how this works in the context of scans, does it mean that we read less data from the source or does it mean that we read the same data from the source just "load" (convert from orc/parquet to internal velox format) lazily based on filters?

Yuhta commented 1 month ago

@jaystarshot In most cases we do the same amount of IO, but saving the decoding time and memory used to converting file format into Velox vector

jaystarshot commented 1 month ago

got it thanks!