duckdblabs / db-benchmark

reproducible benchmark of database-like ops
https://duckdblabs.github.io/db-benchmark/
Mozilla Public License 2.0
144 stars 28 forks source link

Add set key to data.table operations. #90

Open ricardonovaes opened 1 month ago

ricardonovaes commented 1 month ago

setkey make joins extremely faster in data.tables, the codes over join benchmark are not setting the keys properly and can affect the main results.

It is also important in other kinds of data manipulation such as deduce. for instance: setkey(DT, key) unique(DT, by = 'key')

is very much faster than unique(DT, by 'key')

This can go from 15 minutes to seconds for 100GB+ datasets

Joins work the same way:

setkey(DTA, key) setkey(DTB, key)

DTA[DTB, on = .(key)]

I hope it can make the benchmar better!!