Project-MONAI / tutorials

MONAI Tutorials
https://monai.io/started.html
Apache License 2.0
1.77k stars 668 forks source link

AutoRunner GPU usage #1165

Open udiram opened 1 year ago

udiram commented 1 year ago

Describe the bug AutoRunner spikes CPU usage and ends with SIGKILL9. Pytorch recognizes GPU and other scripts utilize GPU.

To Reproduce Steps to reproduce the behavior:

  1. run autorunner on HPC GPU instance
  2. run data analysis
  3. run algorithm generation
  4. fail occurs on Step 3: Model training, validation, and inference

Expected behavior GPU utilized and no fail on run Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Error:

exouser@auto3dseg:~/Documents$ python -m monai.apps.auto3dseg AutoRunner run --input='./task.yaml' 2023-01-13 15:12:06,572 - INFO - AutoRunner using work directory ./work_dir 2023-01-13 15:12:06,574 - INFO - Loading input config ./task.yaml 2023-01-13 15:12:06,604 - INFO - The output_dir is not specified. /home/exouser/Documents/work_dir/ensemble_output will be used to save ensemble predictions 2023-01-13 15:12:06,604 - INFO - Skipping data analysis... 2023-01-13 15:12:06,604 - INFO - Skipping algorithm generation... 2023-01-13 15:12:06,616 - INFO - Launching: python /home/exouser/Documents/work_dir/dints_0/scripts/search.py run --config_file='/home/exouser/Documents/work_dir/dints_0/configs/hyper_parameters_search.yaml','/home/exouser/Documents/work_dir/dints_0/configs/network.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_infer.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_validate.yaml','/home/exouser/Documents/work_dir/dints_0/configs/network_search.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_train.yaml','/home/exouser/Documents/work_dir/dints_0/configs/hyper_parameters.yaml' [info] number of GPUs: 1 [info] world_size: 1 train_files_w: 64 train_files_a: 64 val_files: 33 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/exouser/Documents/MONAI/monai/apps/auto3dseg/main.py", line 24, in fire.Fire( File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/exouser/Documents/MONAI/monai/apps/auto3dseg/auto_runner.py", line 685, in run self._train_algo_in_sequence(history) File "/home/exouser/Documents/MONAI/monai/apps/auto3dseg/auto_runner.py", line 557, in _train_algo_in_sequence algo.train(self.train_params) File "/home/exouser/Documents/work_dir/algorithm_templates/dints/scripts/algo.py", line 398, in train self._run_cmd(cmd_search, devices_info) File "/home/exouser/Documents/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 191, in _run_cmd normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True) File "/usr/lib/python3.8/subprocess.py", line 516, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['python', '/home/exouser/Documents/work_dir/dints_0/scripts/search.py', 'run', "--config_file='/home/exouser/Documents/work_dir/dints_0/configs/hyper_parameters_search.yaml','/home/exouser/Documents/work_dir/dints_0/configs/network.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_infer.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_validate.yaml','/home/exouser/Documents/work_dir/dints_0/configs/network_search.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_train.yaml','/home/exouser/Documents/work_dir/dints_0/configs/hyper_parameters.yaml'"]' died with <Signals.SIGKILL: 9>.

CPU/GPU traces from start till crash:

Screenshot 2023-01-13 at 10 30 17 AM
mingxin-zheng commented 1 year ago

Hi @udiram , what's your CPU in the system? My initial guess is that DataLoader creates too many threads and it got stuck. And a timeout triggers the kill signal. You can try to use the python API set_training_params and set num_workers and cache_num_workers to 2 or 4 to see if the issue still exists.

@dongyang0122 What else do you think could cause this happened?

udiram commented 1 year ago

@mingxin-zheng here's the output form lscpu, I will try the python API and explicitly set the num_workers lower to see if that helps, thanks for the suggestion

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 40 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD EPYC-Milan Processor CPU family: 25 Model: 1 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 32 Stepping: 1 BogoMIPS: 3992.50 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc a cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx m mxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid ex td_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 p cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_time r aes xsave avx f16c rdrand hypervisor lahf_lm cmp_lega cy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch o svw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 i nvpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat n pt nripsave umip pku ospke vaes vpclmulqdq rdpid arch capabilities Virtualization features: Virtualization: AMD-V Hypervisor vendor: KVM Virtualization type: full Caches (sum of all):
L1d: 1 MiB (32 instances) L1i: 1 MiB (32 instances) L2: 16 MiB (32 instances) L3: 1 GiB (32 instances) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerabilities:
Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Not affected Retbleed: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIB P disabled, RSB filling, PBRSB-eIBRS Not affected Srbds: Not affected Tsx async abort: Not affected

mingxin-zheng commented 1 year ago

Looking at the CPU model, I don't think it is the bottleneck...

By the way, AutoRunner will create some cache files in the output working folders. It is helpful to delete the cache or change the work_dir to a new location to debug issues.

udiram commented 1 year ago

Hi @mingxin-zheng, following up on this, changed the directory location, cleared it out completely, no luck. Is it normal that the GPU activity is so low when the model training starts? Also, is there any other debug info you need to see?

mingxin-zheng commented 1 year ago

Hi @dongyang0122 , have you run into similar issues before?

mingxin-zheng commented 1 year ago

Hi @udiram , today I ran into a similar issue as you did.

In my case, I tried Auto3DSeg on a large dataset (~1000 CT images) and my program got silently killed because my ram (~50GB) cannot cache the whole dataset.

The issue was previously reported and I provided a temp solution in this post. Please check it out and see if it solves your problem.

udiram commented 1 year ago

Hi @mingxin-zheng, thanks for the pointer, so would I just run the same command with the algo hash in front? Or is there a more permanent fix in place that I should be trying?

mingxin-zheng commented 1 year ago

@udiram please use the HASH for now. The permanent fix will be in a week or so.

mingxin-zheng commented 1 year ago

Forgot to mention: as said in the post, you need to specify "cache_rate" to lower the req of memory usage.

udiram commented 1 year ago

Got it, thanks, will try it out and close the issue once everything's working, thanks for the help!

udiram commented 1 year ago

@mingxin-zheng, just tried with the same algo hash and the following script

from monai.apps.auto3dseg import AutoRunner

if name == 'main': runner = AutoRunner(input={"name": "Task500_AMOS", "task": "segmentation", "modality": "CT", "datalist": "datalist.json", "dataroot": "data",}, work_dir="work_dir/", analyze=True, algo_gen=True)

runner.set_training_params({"cache_rate": 0.2})
runner.run()

I ran it as follows: ALGO_HASH=3f56d77 python autorunner.py

The output was the same sigkill 9,

please advise

mingxin-zheng commented 1 year ago

Can you check the system memory and see if it is near the limit of the system?

udiram commented 1 year ago

MemTotal: 128813956 kB is the output from grep MemTotal /proc/meminfo

I even tried witch cacherate of 0.1, no luck

dongyang0122 commented 1 year ago

Hi @udiram, could you share how large is your dataset? It is possible that the OOM is caused by data caching after pre-processing. When you adjust the cache rate, does the training process run for a while or stop immediately? Another thing that you can check is the resampling spacing once the configuration of algorithms are generated. You can adjust the value (as the same way to adjust cache_rate) to a coarse spacing making sure the program running properly first.

udiram commented 1 year ago

hi @dongyang0122, Here's the breakdown of how many images: train_files_w: 96 train_files_a: 96 val_files: 48

the data folder I read from is 13G, which is formatted as a standard MSD. (imagesTr, labelsTr, imagesTs)

The training process stays like this:

image

for some time, then fails. there's no GPU activity during that time, only CPU. and changing the cache rate doesn't seem to have an effect on how long it takes to crash, but I will try to quantify this.

Screenshot 2023-01-18 at 10 08 13 AM

what values would you suggest for the resampling spacing?

udiram commented 1 year ago

hi @dongyang0122 @mingxin-zheng just following up to see if there are any suggestions for this,

I went into the dints_0 configs and changed a bunch of vars to minimize the RAM usage, like batch size, resampling, stuff like that, still no luck,

thanks

mingxin-zheng commented 1 year ago

Hi @udiram , sorry to hear you are still experiencing the issue. I just came back from the CNY holidays. Have you tried other networks other than dints? How does it look like in segresnet and segresnet2d?

In the __init__ of AutoRunner, you can specify algos=['segresnet'], as specified in the doc

udiram commented 1 year ago

@mingxin-zheng just tried to run it with only sgresnet but same issues persist, see below updated autorunner

from monai.apps.auto3dseg import AutoRunner

if name == 'main': runner = AutoRunner(input={"name": "Task500_AMOS", "task": "segmentation", "modality": "CT", "datalist": "datalist.json", "dataroot": "data",}, work_dir="work_dir/", analyze=True, algo_gen=True, algos=['segresnet'])

runner.set_training_params({"cache_rate": 0.1})
runner.run()

running it using the same algohash: ALGO_HASH=3f56d77 python autorunner.py

please advise what you think the solution may be,

thanks

udiram commented 1 year ago

Hi @mingxin-zheng @dongyang0122 I've seen some activity on auto3dseg in the past couple of days, would any of the latest commits be of any use to the issue here? please let me know if there's anything you would like me to try on my end,

Thanks

mingxin-zheng commented 1 year ago

Hi @udiram , I am still not able to find the cause of your issue. The recent update is minor such as typo correction.

To replicate your work, can you let me know if this dataset ("AMOS) is available somewhere. It is always good if you can let me know the any other tools you've been using for profiling.

udiram commented 1 year ago

Hi @mingxin-zheng, thanks for the message, the AMOS dataset should be publicly available here https://zenodo.org/record/7155725#.Y0OOCOxBztM

I've been able to run this exact same dataset with nnunet and a standard MONAI U-Net, with no such issues. Please let me know if you need anything else,

Thanks

mingxin-zheng commented 1 year ago

@udiram Did you create a specific datalist, or generate one randomly?

udiram commented 1 year ago

@mingxin-zheng I used the datalist generator from my PR

udiram commented 1 year ago

@mingxin-zheng just adding on to this, would it help for me to send you a copy of my datalist so that you have the exact files that I used?

mingxin-zheng commented 1 year ago

That would be helpful @udiram . Can you post yours here? Thank you!

udiram commented 1 year ago

Hi @mingxin-zheng here's the data I'm using:

https://github.com/udiram/Multi-Modality-Abdominal-Multi-Organ-Segmentation-Challenge-2022/tree/release/auto3dseg_data

Hope this helps!

mingxin-zheng commented 1 year ago

Hi @udiram , I am able to run the training for the datalist you provided. Here is my script:

from monai.apps.auto3dseg import AutoRunner
input_cfg = {
    "name": "Task500_AMOS",
    "task": "segmentation",
    "modality": "CT",
    "datalist": "datalist.json",
    "dataroot": "/datasets/amos/amos22/",
}

runner = AutoRunner(input=input_cfg)
runner.set_training_params({"num_epochs": 5})
runner.run()

The training is still running and I'll report back the resources it has taken when it's done.

udiram commented 1 year ago

Thanks for all your efforts @mingxin-zheng I will try to use the same script on my end and see if it helps. Additionally, are you still using the HASH when you call autorunner?

udiram commented 1 year ago

@mingxin-zheng as an update, I'm still running into the same issue, both with and without the ALGO_HASH, using the exact same datalist, dataroot and config as you, feel free to let me know how much resources it takes when it's finished, maybe the 125GB I have on my instance isn't sufficient for this amount of data.

Further investigation, I created a completely new dataset and datalist with roughly half the amount of images, this seems to work and trains for the moment, at about 75% of my 125GB. I do run into additional issues now:

RuntimeError: received 0 items of ancdata

this was preceded by:

no available indices of class 14 to crop, set the crop ratio of this class to zero. no available indices of class 15 to crop, set the crop ratio of this class to zero. no available indices of class 14 to crop, set the crop ratio of this class to zero.

Which is also followed by loss_torch_epoch = loss_torch[0] / loss_torch[1] ZeroDivisionError: float division by zero

where it ultimately crashes with the following log:

returned non-zero exit status 1.

It would be great to get some clarification on these, and in a larger sense, for autorunner to allocate only the maximum amount of images for the system RAM instead of the ambiguous sigkill error we have been trying to solve.

mingxin-zheng commented 1 year ago

Hi @udiram , as an update, here are the training logs and the resources. I think the training is very demanding on resources. As a reference, I ran this on a single A100-80G, 1TB ram system. I monitored the resource every 30s. auto_runner.log output

udiram commented 1 year ago

hi @mingxin-zheng thanks for the update, I appreciate it, it makes sense that a lot more RAM would be required in this case.

it would be great to have some kind of user facing warning, or something in the tutorial so that people expect a sigkill in these instances going forwards.

Additionally, are you able to provide any context with the errors I outlined above? the ones happening during training?

I appreciate all the support!