Open rcaudy opened 3 years ago
Potential optimization idea (found during #4334): Currently, while using dictionary encoding, if we hit the limit on dictionary size, we discard all the work done so far and fall back to plain encoding for all the pages. A more optimized way would be to add a dictionary page first with the data collected so far and then use plain encoding for all the following pages.
Some more optimization opportunities:
The first and third suggestion above didn't show much improvement. More details can be found on this doc.
Currently, there's nothing at all parallel about our Parquet table writing, barring using multiple tables/files. We should investigate options here if it becomes a performance bottleneck for users.