These readers reads the leaf level rep/def levels in front, and this needs the whole page to be read and uncompressed for V1 parquet file, because the rep/defs are in the page body.
Then later on when the actual data is being read, the rows needs to rewind from the beginning, and the seekToPage(row) would be called again when the nulls are read. This reads and uncompresses all the pages again, which are very expensive operations.
Furthermore, the struct reader reads all pages in a RowGroup altogether in front. This is sometimes unnecessary, if the rows to read are not on every page.
Since these complex types do need to read multiple pages to determine how many pages are in the current batch, we can save the pages for later use, like in my original PR https://github.com/facebookincubator/velox/pull/3315
It'll be better to avoid adding another variable and if-else check in prepareDataPageV1().
Description
These readers reads the leaf level rep/def levels in front, and this needs the whole page to be read and uncompressed for V1 parquet file, because the rep/defs are in the page body.
Then later on when the actual data is being read, the rows needs to rewind from the beginning, and the seekToPage(row) would be called again when the nulls are read. This reads and uncompresses all the pages again, which are very expensive operations.
Furthermore, the struct reader reads all pages in a RowGroup altogether in front. This is sometimes unnecessary, if the rows to read are not on every page.
Since these complex types do need to read multiple pages to determine how many pages are in the current batch, we can save the pages for later use, like in my original PR https://github.com/facebookincubator/velox/pull/3315 It'll be better to avoid adding another variable and if-else check in prepareDataPageV1().