duckdb / duckdb_vss

MIT License
73 stars 8 forks source link

Fully parallelize index construction #21

Closed Maxxen closed 3 months ago

Maxxen commented 3 months ago

Instead of constructing indexes during the sinking into the create index operator, we now buffer all input and then spawn tasks equal to the amount of threads that construct the index with all the data available in parallel. This means we now parallelize over vectors instead of row groups regardless of how much data we receive, and don't need to resize/reallocate the index multiple times (with extra locking) during construction.

On my machine this gives me an almost 10x performance increase. But there's still a bunch more small optimizations we can do.