aai-institute / pyDVL

pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
https://pydvl.org
GNU Lesser General Public License v3.0
100 stars 8 forks source link

Verification of correct usage of resources. #425

Closed kosmitive closed 1 year ago

kosmitive commented 1 year ago

We should test and spit out a warning if the expected number of cores exceeds the physical numbers, e.g. by multi-threaded multi-processing.

mdbenito commented 1 year ago

I'm not sure I follow. What's the "expected number of cores"? if you mean checking whether the user is oversubscribing the system, this would be useful but seems hard to do reliably. One would have to have a separate thread counting child processes and comparing this with n_jobs, but the user might legitimately want to start 10 jobs each using 4 cores, in order to use 40 vCPUs. Alternatively one could compare child processes with num_available_cpus() or whatever, but then again, this does not protect from multithreaded parallelization in worker processes, e.g. because of linear algebra libraries. In the end it seems better to educate users as to the complexity of parallelising, with better and more thorough documentation, and lots of repetition in as many places as possible.

kosmitive commented 1 year ago

Exact checks might be difficult. Warnings for potential oversubcriptions:

  1. Watch stale time of threads and processes
  2. Check if two sub processes are assigned to the same CPU cores. Assume that each sub process locks in CPU's at the beginning.
mdbenito commented 1 year ago
  1. Processes can be in a waiting state for many reasons, some of them benign, like just waiting in a process pool. I don't think this would be informative. Since we don't use threads, any threads we have are created for queue management and such things by lower level libraries to which we don't have access. Also, it is unclear what we would gain from this.
  2. Neither a sufficient nor necessary condition, but possibly informative as an indicator of something being wrong, if we set CPU affinity on subprocesses, and we didn't use process pools, which by design typically spawn more processes than there are cores (AFAIK). If somehow we made this useful, it would have to be at the level of joblib, something we don't want to do.

All in all, I think we need to educate users better. Automagic problem finding feels like a rabbit hole.