This is a big one that I put here partly just to see who's still interested in this repo!
I've been playing around a bit the last couple days with DuckDB, a columnar database that's the heir to MonetDB, which I had thought about for this project but never used.
Duck is much lighter than anything out there except SQLlite, but unlike SQLlite, does a column-oriented store more appropriate for the queries here because related blocks of memory will be close together.
The purpose of MySQL for this project has always been:
Handle the details of JOIN queries
Build massive B-tree indices that allow reasonably fast access to wordid pages and put all the bookids for a given wordid contiguously on disk so that you don't have to seek to a million different places on the hard drive.
Duck DB can do the first fine; and for the second, the builtin BRIN indexes turn out to be faster than MySQL if you can handle the fairly difficult work of sorting a billion records or so before loading it into duck. Having managed to sort the rate my professor bookworm master_bookcounts data in Apache Arrow feather format (that's a whole different story--this can takes days on MySQL for a trillion words, but I think I've got a decent O(N-log(N)) multi-pass on-disk sort going.), it performs better than MySQL on some standard queries.
Plus, it doesn't have the expensive request for large in-memory tables; and the actual on-disk files seems to be a bit smaller, even though we're using 4-byte ints instead of 3-byte ints.
SELECT SUM(count), date_year FROM master_bookcounts NATURAL JOIN fastcat WHERE wordid = 9 GROUP BY date_year;
MySQL: 16 rows in set (5.246 sec)
DuckDB: 3.65 s
SELECT date_year, department, SUM(count) FROM master_bookcounts NATURAL JOIN catalog WHERE wordid = 118 GROUP BY date_year, department;
8624 rows. 1 min 2 seconds in MySQL (Not fair because we've always grouped on integer keys, not text keys for department).
3.78 seconds in DuckDB (Still grouping on text keys!!)
This is a big one that I put here partly just to see who's still interested in this repo!
I've been playing around a bit the last couple days with DuckDB, a columnar database that's the heir to MonetDB, which I had thought about for this project but never used.
Duck is much lighter than anything out there except SQLlite, but unlike SQLlite, does a column-oriented store more appropriate for the queries here because related blocks of memory will be close together.
The purpose of MySQL for this project has always been:
Duck DB can do the first fine; and for the second, the builtin BRIN indexes turn out to be faster than MySQL if you can handle the fairly difficult work of sorting a billion records or so before loading it into duck. Having managed to sort the rate my professor bookworm master_bookcounts data in Apache Arrow feather format (that's a whole different story--this can takes days on MySQL for a trillion words, but I think I've got a decent O(N-log(N)) multi-pass on-disk sort going.), it performs better than MySQL on some standard queries.
Plus, it doesn't have the expensive request for large in-memory tables; and the actual on-disk files seems to be a bit smaller, even though we're using 4-byte ints instead of 3-byte ints.
MySQL: 16 rows in set (5.246 sec) DuckDB: 3.65 s
8624 rows. 1 min 2 seconds in MySQL (Not fair because we've always grouped on integer keys, not text keys for department). 3.78 seconds in DuckDB (Still grouping on text keys!!)