Open anishathalye opened 2 years ago
i recommend making the default num_cpus=threads//2
on most machines just to be safe.
On several intel processors, i have benchmarked and found using num_cpus()
is slower than than using num_cpus() // 2
. For example, should use 10 instead of 20 for the Intel Core i9-9820X X-Series Processor.
On several intel processors, i have benchmarked and found using
num_cpus()
is slower than than usingnum_cpus() // 2
. For example, should use 10 instead of 20 for the Intel Core i9-9820X X-Series Processor.
That's because most modern Intel CPUs support hyperthreading, and most people's machines have hyperthreading enabled. We shouldn't blindly use multiprocessing.cpu_count() // 2
because it'll only use half the physical cores on machines that have hyperthreading disabled. E.g. psutil
.cpu_count(logical=False)
. (Though we probably don't want to add another dependency for this; we could just write the code ourselves.)
Update: This has been addressed for regular classification tasks since cleanlab v2.3.0 (including highly scalable label issue detection via find_label_issues_batched()).
For multi-label classification datasets, the multiprocessing could still use improvement. Currently the multiprocessing is done across classes in cleanlab.filter.find_label_issues()
, but in multi-label classification, each class is treated as a separate binary problem, and there is no multiprocessing across these problems.
Right now, using multiprocessing in e.g.
find_label_issues
gives barely any speedup. We should be able to get to near-perfect scalability.Related: we should launch as many processes as there are physical cores, not logical cores, because for our workload, hyperthreading doesn't give a speedup (likely slows us down a bit).
Related: with small datasets, the cost of spawning new processes exceeds the gain from parallelism. A simple fix is to hard-code a size under which we avoid using multiprocessing even when
n_jobs=None
.Related to #287, should be co-designed with #297.