Closed wyli closed 1 year ago
the same test might have some memory issue as well
current epoch: 1 current mean dice: 0.3736 best mean dice: 0.3736 at epoch 1
----------
epoch 2/2
learning rate is set to 0.0125
Traceback (most recent call last):
File "/tmp/tmpox_opph9/workdir/dints_0/scripts/search.py", line 647, in <module>
fire.Fire()
File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/tmp/tmpox_opph9/workdir/dints_0/scripts/search.py", line 331, in run
scaler.step(optimizer)
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 370, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step
retval = optimizer.step(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper
out = func(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
ret = func(self, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/optim/sgd.py", line 76, in step
sgd(params_with_grad,
File "/opt/conda/lib/python3.8/site-packages/torch/optim/sgd.py", line 222, in sgd
func(params,
File "/opt/conda/lib/python3.8/site-packages/torch/optim/sgd.py", line 291, in _multi_tensor_sgd
device_grads = torch._foreach_add(device_grads, device_params, alpha=weight_decay)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 2.02 GiB already allocated; 15.00 MiB free; 2.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 42535 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 42534) of binary: /opt/conda/bin/python
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/tmp/tmpox_opph9/workdir/dints_0/scripts/search.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-03-16_05:18:53
host : 957ff6064347
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 42534)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
EFinished test: test_ensemble (tests.test_auto3dseg_ensemble.TestEnsembleBuilder) (58.7s)
Need to discuss with @dongyang0122 .
I am inclined to update the algorithm templates. What do you think @dongyang0122 ?
Which version of PyTorch should we use for developing to pass the unit test?
We are using PyTorch 1.8 - 1.13. I think the PyTorch 1.8-1.12 tests are optional during the PR.
Is it viable to detect the version of PyTorch in dints Auto3DSeg templates, and fall back to the previous scheduler if it is not causing too much of performance decrease? I'm also okay to skip the tests for now, but it will mean PyTorch 1.12 users will not be able to use Auto3DSeg.
I may need to re-write the scheduler logic manually to avoid the version issue.
ok, let's keep this ticket open and I'll make a workaround to skip the tests for torch <=1.12 to unblock the release candidate
the related tutorials https://github.com/Project-MONAI/tutorials are also broken
Hi @dongyang0122 @wyli , I followed @dongyang0122 's thought partially to fix this issue in research-contribution: https://github.com/Project-MONAI/research-contributions/pull/204
The main issue there was that the class script is from PyTorch. So I include the full license in the script for legal.
PolynomialLR is new in torch 1.13 which breaks tests with earlier versions of pytorch
issue introduced by https://github.com/Project-MONAI/MONAI/commit/af46d7b60b71c2f8577d291c08a0f26cbe4552b0 @mingxin-zheng @dongyang0122 could you please have a look soon? skip the test or update the algo