facebookincubator / velox

A composable and fully extensible C++ execution engine library for data management systems.
https://velox-lib.io/
Apache License 2.0
3.54k stars 1.17k forks source link

Parquet Array and Struct readers read and uncompress all V1 pages twice #3614

Open yingsu00 opened 1 year ago

yingsu00 commented 1 year ago

Description

These readers reads the leaf level rep/def levels in front, and this needs the whole page to be read and uncompressed for V1 parquet file, because the rep/defs are in the page body.

ListColumnReader::read
    ensureRepDefs
        RepeatedColumnReader::readLeafRepDefs
            If  leaf
                Return PageReader::loadNextPage. // read rep/def levels
                    preloadRepDefs()
                        seekToPage(kRepDefOnly);
                                prepareDataPageV1()
                                        pageData_ = readBytes(..);
                                        pageData_ = uncompressData();

Then later on when the actual data is being read, the rows needs to rewind from the beginning, and the seekToPage(row) would be called again when the nulls are read. This reads and uncompresses all the pages again, which are very expensive operations.

Furthermore, the struct reader reads all pages in a RowGroup altogether in front. This is sometimes unnecessary, if the rows to read are not on every page.

Since these complex types do need to read multiple pages to determine how many pages are in the current batch, we can save the pages for later use, like in my original PR https://github.com/facebookincubator/velox/pull/3315 It'll be better to avoid adding another variable and if-else check in prepareDataPageV1().

yingsu00 commented 1 year ago

cc @Yuhta @frankobe @oerling