cleanlab / cleanlab

The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
https://cleanlab.ai
GNU Affero General Public License v3.0
9.42k stars 728 forks source link

Improve use of multiprocessing #298

Open anishathalye opened 2 years ago

anishathalye commented 2 years ago

Right now, using multiprocessing in e.g. find_label_issues gives barely any speedup. We should be able to get to near-perfect scalability.

Related: we should launch as many processes as there are physical cores, not logical cores, because for our workload, hyperthreading doesn't give a speedup (likely slows us down a bit).

Related: with small datasets, the cost of spawning new processes exceeds the gain from parallelism. A simple fix is to hard-code a size under which we avoid using multiprocessing even when n_jobs=None.

Related to #287, should be co-designed with #297.

cgnorthcutt commented 2 years ago

i recommend making the default num_cpus=threads//2 on most machines just to be safe.

On several intel processors, i have benchmarked and found using num_cpus() is slower than than using num_cpus() // 2. For example, should use 10 instead of 20 for the Intel Core i9-9820X X-Series Processor.

anishathalye commented 2 years ago

On several intel processors, i have benchmarked and found using num_cpus() is slower than than using num_cpus() // 2. For example, should use 10 instead of 20 for the Intel Core i9-9820X X-Series Processor.

That's because most modern Intel CPUs support hyperthreading, and most people's machines have hyperthreading enabled. We shouldn't blindly use multiprocessing.cpu_count() // 2 because it'll only use half the physical cores on machines that have hyperthreading disabled. E.g. psutil.cpu_count(logical=False). (Though we probably don't want to add another dependency for this; we could just write the code ourselves.)

jwmueller commented 1 year ago

Update: This has been addressed for regular classification tasks since cleanlab v2.3.0 (including highly scalable label issue detection via find_label_issues_batched()).

For multi-label classification datasets, the multiprocessing could still use improvement. Currently the multiprocessing is done across classes in cleanlab.filter.find_label_issues(), but in multi-label classification, each class is treated as a separate binary problem, and there is no multiprocessing across these problems.