DeepRegNet / DeepReg

Medical image registration using deep learning
Apache License 2.0
564 stars 76 forks source link

Issue #720: enable to define num_parallel_calls in config #724

Closed mathpluscode closed 3 years ago

mathpluscode commented 3 years ago

Description

Fixes #720

Type of change

What types of changes does your code introduce to DeepReg?

Please check the boxes that apply after submitting the pull request.

Checklist

Please check the boxes that apply after submitting the pull request.

If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

codecov[bot] commented 3 years ago

Codecov Report

Merging #724 (767447e) into main (6b1f4e0) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #724   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           41        41           
  Lines         2480      2484    +4     
=========================================
+ Hits          2480      2484    +4     
Impacted Files Coverage Δ
deepreg/model/network.py 100.00% <ø> (ø)
deepreg/dataset/loader/interface.py 100.00% <100.00%> (ø)
deepreg/predict.py 100.00% <100.00%> (ø)
deepreg/train.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6b1f4e0...767447e. Read the comment docs.

fepegar commented 3 years ago

I know very little about parallel processing, but I have some comments.

  1. Is num_cpus descriptive? I would say CPU cores, so maybe num_cores. PyTorch uses num_workers. I'm not sure it's equivalent, though.
  2. Is it a good idea to use all of them by default? What about disabling multiprocessing by default and allowing the user to enable it by choosing >1 cores? Again, as a reference, in PyTorch, if num_workers is 0, multiprocessing will be disabled. Users have to explicitly define num_workers > 0. I haven't seen memory issues in the DGX (since February), but I think when 1 (or more!) users are training simultaneously using all cores, everything (even ls) is much slower. If users set e.g. 10 cores/GPU, I suspect things would run faster for everyone, and this is more likely to happen if users have to manually set up the multiprocessing settings.
zacbaum commented 3 years ago

I know very little about parallel processing, but I have some comments.

  1. Is num_cpus descriptive? I would say CPU cores, so maybe num_cores. PyTorch uses num_workers. I'm not sure it's equivalent, though.

In TF (at least for what we're doing here) it is the number of scheduled cores that are used - so num_cores is likely better.

  1. Is it a good idea to use all of them by default? What about disabling multiprocessing by default and allowing the user to enable it by choosing >1 cores? Again, as a reference, in PyTorch, if num_workers is 0, multiprocessing will be disabled. Users have to explicitly define num_workers > 0. I haven't seen memory issues in the DGX (since February), but I think when 1 (or more!) users are training simultaneously using all cores, everything (even ls) is much slower. If users set e.g. 10 cores/GPU, I suspect things would run faster for everyone, and this is more likely to happen if users have to manually set up the multiprocessing settings.

It's definitely not a good standard - especially where most of us have been using deepreg (using shared systems), as it can lock things up if the machine has low cores / low RAM in general (as is the case with the PTs). In DeepReg, setting the available cores for the data loaders will not change the amount of multiprocessing, etc. that occurs by changing num_workers in model.fit(). These can be set independently.

Aside: The reason ls will be slower is due to RAM being over-allocated. Metadata for these commands is cached in RAM, so when that runs out, the OS has to pull things from disk.

fepegar commented 3 years ago

Thanks, @zacbaum.

Aside: The reason ls will be slower is due to RAM being over-allocated. Metadata for these commands is cached in RAM, so when that runs out, the OS has to pull things from disk.

Are you sure? I've noticed this when all cores are being 100% used, but there's a lot of RAM still available.

zacbaum commented 3 years ago

Are you sure? I've noticed this when all cores are being 100% used, but there's a lot of RAM still available.

Well if there's no RAM it will be, but it could also be compounded by lack of CPU - Entirely possible! And also likely if CPU util > 90% or so.

mathpluscode commented 3 years ago

@zacbaum @fepegar Hi, I've renamed the vars and changed the default behaviour :) please approve if it looks ok, thx!

mathpluscode commented 3 years ago

We need to be consistent throughout - things refer to parallel calls, CPUs, workers; to someone who has no idea what this means, this isn't helpful.

Can we add some more detail to the docs for this to also say why someone may use this? (e.g. shared resource and don't want to leverage all the resource ;))

I tried again :)

number elements to process asynchronously in parallel
during preprocessing, -1 means unlimited, heuristically it should be set to
the number of CPU cores available. AUTOTUNE=-1 means not limited.

@zacbaum see if it's clear to you know, if not plz propose one lol

mathpluscode commented 3 years ago

@zacbaum :0 I'm still waiting for approvals :)

fepegar commented 3 years ago

@mathpluscode By default, all CPU cores are used?

zacbaum commented 3 years ago

@zacbaum :0 I'm still waiting for approvals :)

@mathpluscode, I have left a few comments --

I (along with many others) was not working during the Easter break, and certainly not at 11:56pm on a Friday when you tagged me, so you'll have to forgive the wait...

mathpluscode commented 3 years ago

@mathpluscode By default, all CPU cores are used?

By default, CPU is limited to 1 at the very beginning with the definition of the environment variables. So the num_parallel_call will be limited by that.

I insist on not setting both to 1, as if the user does not want any limit, he/she has to modify two values. Now, you can limit the CPUs with one single value, as another one is bounded by the previous one.