Open udiram opened 1 year ago
Hi @udiram , what's your CPU in the system? My initial guess is that DataLoader creates too many threads and it got stuck. And a timeout triggers the kill signal. You can try to use the python API set_training_params
and set num_workers
and cache_num_workers
to 2 or 4 to see if the issue still exists.
@dongyang0122 What else do you think could cause this happened?
@mingxin-zheng here's the output form lscpu, I will try the python API and explicitly set the num_workers lower to see if that helps, thanks for the suggestion
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 40 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Vendor ID: AuthenticAMD
Model name: AMD EPYC-Milan Processor
CPU family: 25
Model: 1
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 32
Stepping: 1
BogoMIPS: 3992.50
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mc
a cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx m
mxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid ex
td_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 p
cid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_time
r aes xsave avx f16c rdrand hypervisor lahf_lm cmp_lega
cy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch o
svw topoext perfctr_core invpcid_single ssbd ibrs ibpb
stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 i
nvpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt
xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat n
pt nripsave umip pku ospke vaes vpclmulqdq rdpid arch
capabilities
Virtualization features:
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
Caches (sum of all):
L1d: 1 MiB (32 instances)
L1i: 1 MiB (32 instances)
L2: 16 MiB (32 instances)
L3: 1 GiB (32 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
and seccomp
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer
sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIB
P disabled, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
Looking at the CPU model, I don't think it is the bottleneck...
By the way, AutoRunner will create some cache files in the output working folders. It is helpful to delete the cache or change the work_dir
to a new location to debug issues.
Hi @mingxin-zheng, following up on this, changed the directory location, cleared it out completely, no luck. Is it normal that the GPU activity is so low when the model training starts? Also, is there any other debug info you need to see?
Hi @dongyang0122 , have you run into similar issues before?
Hi @udiram , today I ran into a similar issue as you did.
In my case, I tried Auto3DSeg on a large dataset (~1000 CT images) and my program got silently killed because my ram (~50GB) cannot cache the whole dataset.
The issue was previously reported and I provided a temp solution in this post. Please check it out and see if it solves your problem.
Hi @mingxin-zheng, thanks for the pointer, so would I just run the same command with the algo hash in front? Or is there a more permanent fix in place that I should be trying?
@udiram please use the HASH for now. The permanent fix will be in a week or so.
Forgot to mention: as said in the post, you need to specify "cache_rate" to lower the req of memory usage.
Got it, thanks, will try it out and close the issue once everything's working, thanks for the help!
@mingxin-zheng, just tried with the same algo hash and the following script
from monai.apps.auto3dseg import AutoRunner
if name == 'main': runner = AutoRunner(input={"name": "Task500_AMOS", "task": "segmentation", "modality": "CT", "datalist": "datalist.json", "dataroot": "data",}, work_dir="work_dir/", analyze=True, algo_gen=True)
runner.set_training_params({"cache_rate": 0.2})
runner.run()
I ran it as follows: ALGO_HASH=3f56d77 python autorunner.py
The output was the same sigkill 9,
please advise
Can you check the system memory and see if it is near the limit of the system?
MemTotal: 128813956 kB is the output from grep MemTotal /proc/meminfo
I even tried witch cacherate of 0.1, no luck
Hi @udiram, could you share how large is your dataset? It is possible that the OOM is caused by data caching after pre-processing. When you adjust the cache rate, does the training process run for a while or stop immediately? Another thing that you can check is the resampling spacing once the configuration of algorithms are generated. You can adjust the value (as the same way to adjust cache_rate) to a coarse spacing making sure the program running properly first.
hi @dongyang0122, Here's the breakdown of how many images: train_files_w: 96 train_files_a: 96 val_files: 48
the data folder I read from is 13G, which is formatted as a standard MSD. (imagesTr, labelsTr, imagesTs)
The training process stays like this:
for some time, then fails. there's no GPU activity during that time, only CPU. and changing the cache rate doesn't seem to have an effect on how long it takes to crash, but I will try to quantify this.
what values would you suggest for the resampling spacing?
hi @dongyang0122 @mingxin-zheng just following up to see if there are any suggestions for this,
I went into the dints_0 configs and changed a bunch of vars to minimize the RAM usage, like batch size, resampling, stuff like that, still no luck,
thanks
Hi @udiram , sorry to hear you are still experiencing the issue. I just came back from the CNY holidays. Have you tried other networks other than dints
? How does it look like in segresnet
and segresnet2d
?
In the __init__
of AutoRunner
, you can specify algos=['segresnet']
, as specified in the doc
@mingxin-zheng just tried to run it with only sgresnet but same issues persist, see below updated autorunner
from monai.apps.auto3dseg import AutoRunner
if name == 'main': runner = AutoRunner(input={"name": "Task500_AMOS", "task": "segmentation", "modality": "CT", "datalist": "datalist.json", "dataroot": "data",}, work_dir="work_dir/", analyze=True, algo_gen=True, algos=['segresnet'])
runner.set_training_params({"cache_rate": 0.1})
runner.run()
running it using the same algohash: ALGO_HASH=3f56d77 python autorunner.py
please advise what you think the solution may be,
thanks
Hi @mingxin-zheng @dongyang0122 I've seen some activity on auto3dseg in the past couple of days, would any of the latest commits be of any use to the issue here? please let me know if there's anything you would like me to try on my end,
Thanks
Hi @udiram , I am still not able to find the cause of your issue. The recent update is minor such as typo correction.
To replicate your work, can you let me know if this dataset ("AMOS) is available somewhere. It is always good if you can let me know the any other tools you've been using for profiling.
Hi @mingxin-zheng, thanks for the message, the AMOS dataset should be publicly available here https://zenodo.org/record/7155725#.Y0OOCOxBztM
I've been able to run this exact same dataset with nnunet and a standard MONAI U-Net, with no such issues. Please let me know if you need anything else,
Thanks
@udiram Did you create a specific datalist, or generate one randomly?
@mingxin-zheng I used the datalist generator from my PR
@mingxin-zheng just adding on to this, would it help for me to send you a copy of my datalist so that you have the exact files that I used?
That would be helpful @udiram . Can you post yours here? Thank you!
Hi @mingxin-zheng here's the data I'm using:
Hope this helps!
Hi @udiram , I am able to run the training for the datalist you provided. Here is my script:
from monai.apps.auto3dseg import AutoRunner
input_cfg = {
"name": "Task500_AMOS",
"task": "segmentation",
"modality": "CT",
"datalist": "datalist.json",
"dataroot": "/datasets/amos/amos22/",
}
runner = AutoRunner(input=input_cfg)
runner.set_training_params({"num_epochs": 5})
runner.run()
The training is still running and I'll report back the resources it has taken when it's done.
Thanks for all your efforts @mingxin-zheng I will try to use the same script on my end and see if it helps. Additionally, are you still using the HASH when you call autorunner?
@mingxin-zheng as an update, I'm still running into the same issue, both with and without the ALGO_HASH, using the exact same datalist, dataroot and config as you, feel free to let me know how much resources it takes when it's finished, maybe the 125GB I have on my instance isn't sufficient for this amount of data.
Further investigation, I created a completely new dataset and datalist with roughly half the amount of images, this seems to work and trains for the moment, at about 75% of my 125GB. I do run into additional issues now:
RuntimeError: received 0 items of ancdata
this was preceded by:
no available indices of class 14 to crop, set the crop ratio of this class to zero. no available indices of class 15 to crop, set the crop ratio of this class to zero. no available indices of class 14 to crop, set the crop ratio of this class to zero.
Which is also followed by loss_torch_epoch = loss_torch[0] / loss_torch[1] ZeroDivisionError: float division by zero
where it ultimately crashes with the following log:
returned non-zero exit status 1.
It would be great to get some clarification on these, and in a larger sense, for autorunner to allocate only the maximum amount of images for the system RAM instead of the ambiguous sigkill error we have been trying to solve.
Hi @udiram , as an update, here are the training logs and the resources. I think the training is very demanding on resources. As a reference, I ran this on a single A100-80G, 1TB ram system. I monitored the resource every 30s. auto_runner.log
hi @mingxin-zheng thanks for the update, I appreciate it, it makes sense that a lot more RAM would be required in this case.
it would be great to have some kind of user facing warning, or something in the tutorial so that people expect a sigkill in these instances going forwards.
Additionally, are you able to provide any context with the errors I outlined above? the ones happening during training?
I appreciate all the support!
Describe the bug AutoRunner spikes CPU usage and ends with SIGKILL9. Pytorch recognizes GPU and other scripts utilize GPU.
To Reproduce Steps to reproduce the behavior:
Expected behavior GPU utilized and no fail on run Screenshots If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
OS --> ubuntu 20.04 (also tried 22.04)
Python version --> 3.8 (also tried 3.10.6)
MONAI version [e.g. git commit hash] --> 1.1.0+21.g4b464e7b
GPU models and configuration -->
exouser@auto3dseg:~/Documents$ nvidia-smi Fri Jan 13 19:19:07 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GRID A100X-40C On | 00000000:04:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Error:
exouser@auto3dseg:~/Documents$ python -m monai.apps.auto3dseg AutoRunner run --input='./task.yaml' 2023-01-13 15:12:06,572 - INFO - AutoRunner using work directory ./work_dir 2023-01-13 15:12:06,574 - INFO - Loading input config ./task.yaml 2023-01-13 15:12:06,604 - INFO - The output_dir is not specified. /home/exouser/Documents/work_dir/ensemble_output will be used to save ensemble predictions 2023-01-13 15:12:06,604 - INFO - Skipping data analysis... 2023-01-13 15:12:06,604 - INFO - Skipping algorithm generation... 2023-01-13 15:12:06,616 - INFO - Launching: python /home/exouser/Documents/work_dir/dints_0/scripts/search.py run --config_file='/home/exouser/Documents/work_dir/dints_0/configs/hyper_parameters_search.yaml','/home/exouser/Documents/work_dir/dints_0/configs/network.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_infer.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_validate.yaml','/home/exouser/Documents/work_dir/dints_0/configs/network_search.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_train.yaml','/home/exouser/Documents/work_dir/dints_0/configs/hyper_parameters.yaml' [info] number of GPUs: 1 [info] world_size: 1 train_files_w: 64 train_files_a: 64 val_files: 33 Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/exouser/Documents/MONAI/monai/apps/auto3dseg/main.py", line 24, in
fire.Fire(
File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/exouser/.local/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/exouser/Documents/MONAI/monai/apps/auto3dseg/auto_runner.py", line 685, in run
self._train_algo_in_sequence(history)
File "/home/exouser/Documents/MONAI/monai/apps/auto3dseg/auto_runner.py", line 557, in _train_algo_in_sequence
algo.train(self.train_params)
File "/home/exouser/Documents/work_dir/algorithm_templates/dints/scripts/algo.py", line 398, in train
self._run_cmd(cmd_search, devices_info)
File "/home/exouser/Documents/MONAI/monai/apps/auto3dseg/bundle_gen.py", line 191, in _run_cmd
normal_out = subprocess.run(cmd.split(), env=ps_environ, check=True)
File "/usr/lib/python3.8/subprocess.py", line 516, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['python', '/home/exouser/Documents/work_dir/dints_0/scripts/search.py', 'run', "--config_file='/home/exouser/Documents/work_dir/dints_0/configs/hyper_parameters_search.yaml','/home/exouser/Documents/work_dir/dints_0/configs/network.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_infer.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_validate.yaml','/home/exouser/Documents/work_dir/dints_0/configs/network_search.yaml','/home/exouser/Documents/work_dir/dints_0/configs/transforms_train.yaml','/home/exouser/Documents/work_dir/dints_0/configs/hyper_parameters.yaml'"]' died with <Signals.SIGKILL: 9>.
CPU/GPU traces from start till crash: