biolab / orange3

🍊 :bar_chart: :bulb: Orange: Interactive data analysis
https://orangedatamining.com
Other
4.87k stars 1.02k forks source link

Performance in Windows 10 #6842

Open mrahmadt opened 4 months ago

mrahmadt commented 4 months ago

Hello Everyone

Not sure if I'm doing something wrong or this is the default behavior. I have a server with following specs

32 x Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz (2 Sockets) Memory 125GB SSD Disks

I installed Proxmox.com (proxmox.com) and created Windows 10 machine (32 Processors and 64GB Memory)

Everything is working fine in Orange, but "Test & Score" takes hours to process 600M CSV file with "Random Forest", and the strange thing, it's not utalizing the full CPU/Memory of the machine!

Anything I can do to make Orange use the full CPU/Memory resources?

Screenshot 2024-06-27 at 05 15 38

Orange 3.37.0 (Orange3-3.37.0-Miniconda-x86_64.exe)

thocevar commented 4 months ago

My experiments on Windows show that running cross validation or random sampling in Test & Score does not run the threads in parallel utilizing all CPUs.

Random forest and other trivially parallelizable methods (e.g. XGBoost) could be parallelized with the n_jobs parameter even for a single training/prediction call (such as testing on test data) but are not.

The combination of both could be problematic by spawning more threads than there are processors. The only exception that I found is Logistic Regression, which always utilizes all CPUs, but probably on some lower level.

This needs further discussion.

thocevar commented 4 months ago

Parallelization in Test & Score was intentionally removed in https://github.com/biolab/orange3/pull/2300/commits/1f8d008b84e9e7c3bd54e79a662779e914eb6443.

The easiest way of re-introducing parallelization would be on the level of individual models (e.g. random forest), where scikit-learn takes care of it (n_jobs=-1).