duckdblabs / db-benchmark

reproducible benchmark of database-like ops
https://duckdblabs.github.io/db-benchmark/
Mozilla Public License 2.0
136 stars 27 forks source link

Inconsistent multicore use across languages? #47

Closed grantmcdermott closed 9 months ago

grantmcdermott commented 9 months ago

Probably a question for @jangorecki, since this applies to data.table's thread retrieval function. But it will effect other R implementations relying on getDThreads, e.g. the newly proposed r-collapse PR (#33).

Here's an example from my own EC2 server.

First, basic machine characteristics.

[~]$ grep -E -w 'VERSION|NAME|PRETTY_NAME' /etc/os-release    
NAME="Amazon Linux"
VERSION="2"
PRETTY_NAME="Amazon Linux 2"

[~]$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('  
CPU(s):              96
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2

The Python call gets the 96 CPU number right.

[~]$ python3 -c 'import multiprocessing as mp; print(mp.cpu_count())'
96

But data.table only recognises half of the available cores (physical only?).

[~]$ Rscript -e "data.table::getDTthreads()" 
[1] 48
jangorecki commented 9 months ago

That is why test scripts include call to setDTthreads() at start to male it consistent

Tmonster commented 9 months ago

Hmm, this is a good point. Globally searching for setDTthreads() shows that the function is only called in datatable runs. Many other R solutions use ncores = parallel::detectCores().

Still working on publishing the results, it will most likely come with a blog post explaining the delay (and other updates).

As a preview, we will move to a click-bench model where anyone can request the results be updated if they can show faster execution times (with different settings/new solution version) on the same machine. After verification, the results will be updated. This means that if R-collapse or data.table used an incorrect number of threads, the benchmark can be updated within a week.

jangorecki commented 9 months ago

setDTthreads is only used in datatable runs because it controls number of threads only for datatable.

grantmcdermott commented 9 months ago

setDTthreads is only used in datatable runs because it controls number of threads only for datatable.

It's also used in the proposed r-collapse run. At least, getDTthteads is, without the corresponding/corrective set call earlier afaict. I'll flag in that PR and then close this issue.