Bookworm-project / BookwormDB

Tools for text tokenization and encoding
MIT License
84 stars 12 forks source link

Switch to DuckDB #146

Open bmschmidt opened 3 years ago

bmschmidt commented 3 years ago

Once merged, MySQL is done with. With bigrams restored, I think it's pretty close to being ready.

organisciak commented 3 years ago

I trust you to do this merge, since you have the freshest understanding of the code. Perhaps loop in HTRC people like @borice?

How does DuckDB perform?

bmschmidt commented 3 years ago

This is not yet completely ready for review, but close enough that I want to put it in tracking.

I'm still generally finding duckdb to work at, oh about 1.5x faster on standard queries on the Rate My Professor bookworm, and much faster on ingest. I just made a major change to the sort code though, by letting duckdb handle the word sorting (the stage that used to be index building in mysql, so often 6-12 hours.)

Duckdb has also just added forms of compression on numbers that drop the disk space requirements compared to MySQL significantly--rough guess, databases should be one-third the size they were with MySQL.