Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.94k stars 1.09k forks source link

test_auto3dseg_ensemble #6155

Closed wyli closed 1 year ago

wyli commented 1 year ago

PolynomialLR is new in torch 1.13 which breaks tests with earlier versions of pytorch

[2023-03-16T00:35:16.877Z] current epoch: 2 current mean dice: 0.4376 best mean dice: 0.4376 at epoch 2
[2023-03-16T00:35:16.877Z] train completed, best_metric: 0.4376 at epoch: 2
[2023-03-16T00:35:16.877Z] 2023-03-16 00:35:14,819 - INFO - The keys num_warmup_epochs cannot be found in the /tmp/tmpn5i1ty3o/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key num_warmup_epochs.
[2023-03-16T00:35:16.877Z] 2023-03-16 00:35:14,819 - INFO - The keys use_pretrain cannot be found in the /tmp/tmpn5i1ty3o/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key use_pretrain.
[2023-03-16T00:35:16.877Z] 2023-03-16 00:35:14,819 - INFO - The keys pretrained_path cannot be found in the /tmp/tmpn5i1ty3o/workdir/dints_0/configs/hyper_parameters.yaml for training. Skipped overriding key pretrained_path.
[2023-03-16T00:35:16.877Z] 2023-03-16 00:35:14,820 - INFO - Launching: python /tmp/tmpn5i1ty3o/workdir/dints_0/scripts/train.py run --config_file='/tmp/tmpn5i1ty3o/workdir/dints_0/configs/transforms_validate.yaml','/tmp/tmpn5i1ty3o/workdir/dints_0/configs/transforms_train.yaml','/tmp/tmpn5i1ty3o/workdir/dints_0/configs/network.yaml','/tmp/tmpn5i1ty3o/workdir/dints_0/configs/hyper_parameters.yaml','/tmp/tmpn5i1ty3o/workdir/dints_0/configs/transforms_infer.yaml','/tmp/tmpn5i1ty3o/workdir/dints_0/configs/hyper_parameters_search.yaml','/tmp/tmpn5i1ty3o/workdir/dints_0/configs/network_search.yaml' --training#num_images_per_batch=2 --training#num_epochs=2 --training#num_epochs_per_validation=1 --training#determ=True
[2023-03-16T00:35:24.956Z] monai.transforms.io.dictionary LoadImaged.__init__:image_only: Current default value of argument `image_only=False` has been deprecated since version 1.1. It will be changed to `image_only=True` in version 1.3.
[2023-03-16T00:35:24.958Z] [info] number of GPUs: 1
[2023-03-16T00:35:24.958Z] [info] world_size: 1
[2023-03-16T00:35:24.958Z] train_files: 8
[2023-03-16T00:35:24.958Z] val_files: 4
[2023-03-16T00:35:24.958Z] 2023-03-16 00:35:19.387838 - Length of input patch is recommended to be a multiple of 32.
[2023-03-16T00:35:24.958Z] num_epochs 2
[2023-03-16T00:35:24.958Z] num_epochs_per_validation 1
[2023-03-16T00:35:24.958Z] Traceback (most recent call last):
[2023-03-16T00:35:24.958Z]   File "/home/jenkins/agent/workspace/Monai-pytorch-versions/monai/bundle/config_item.py", line 292, in instantiate
[2023-03-16T00:35:24.958Z]     return instantiate(modname, mode, **args)
[2023-03-16T00:35:24.958Z]   File "/home/jenkins/agent/workspace/Monai-pytorch-versions/monai/utils/module.py", line 246, in instantiate
[2023-03-16T00:35:24.958Z]     raise ModuleNotFoundError(f"Cannot locate class or function path: '{__path}'.")
[2023-03-16T00:35:24.958Z] ModuleNotFoundError: Cannot locate class or function path: 'torch.optim.lr_scheduler.PolynomialLR'.
[2023-03-16T00:35:24.958Z] 
[2023-03-16T00:35:24.958Z] The above exception was the direct cause of the following exception:
[2023-03-16T00:35:24.958Z] 
[2023-03-16T00:35:24.958Z] Traceback (most recent call last):
[2023-03-16T00:35:24.958Z]   File "/tmp/tmpn5i1ty3o/workdir/dints_0/scripts/train.py", line 479, in <module>
[2023-03-16T00:35:24.958Z]     fire.Fire()
[2023-03-16T00:35:24.958Z]   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
[2023-03-16T00:35:24.958Z]     component_trace = _Fire(component, args, parsed_flag_args, context, name)
[2023-03-16T00:35:24.958Z]   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
[2023-03-16T00:35:24.958Z]     component, remaining_args = _CallAndUpdateTrace(
[2023-03-16T00:35:24.958Z]   File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
[2023-03-16T00:35:24.958Z]     component = fn(*varargs, **kwargs)
[2023-03-16T00:35:24.958Z]   File "/tmp/tmpn5i1ty3o/workdir/dints_0/scripts/train.py", line 214, in run
[2023-03-16T00:35:24.958Z]     lr_scheduler = lr_scheduler_part.instantiate(optimizer=optimizer)
[2023-03-16T00:35:24.958Z]   File "/home/jenkins/agent/workspace/Monai-pytorch-versions/monai/bundle/config_item.py", line 294, in instantiate
[2023-03-16T00:35:24.958Z]     raise RuntimeError(f"Failed to instantiate {self}.") from e
[2023-03-16T00:35:24.958Z] RuntimeError: Failed to instantiate {'_target_': 'torch.optim.lr_scheduler.PolynomialLR', 'optimizer': {'_target_': 'torch.optim.SGD', 'lr': 0.2, 'momentum': 0.9, 'weight_decay': 4e-05}, 'power': 0.5, 'total_iters': 3}.
[2023-03-16T00:35:25.212Z] 
[2023-03-16T00:35:25.212Z] EFinished test: test_ensemble (tests.test_auto3dseg_ensemble.TestEnsembleBuilder) (76.7s)

issue introduced by https://github.com/Project-MONAI/MONAI/commit/af46d7b60b71c2f8577d291c08a0f26cbe4552b0 @mingxin-zheng @dongyang0122 could you please have a look soon? skip the test or update the algo

wyli commented 1 year ago

the same test might have some memory issue as well

current epoch: 1 current mean dice: 0.3736 best mean dice: 0.3736 at epoch 1
----------
epoch 2/2
learning rate is set to 0.0125
Traceback (most recent call last):
  File "/tmp/tmpox_opph9/workdir/dints_0/scripts/search.py", line 647, in <module>
    fire.Fire()
  File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/opt/conda/lib/python3.8/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/tmp/tmpox_opph9/workdir/dints_0/scripts/search.py", line 331, in run
    scaler.step(optimizer)
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 370, in step
    retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/cuda/amp/grad_scaler.py", line 290, in _maybe_opt_step
    retval = optimizer.step(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 280, in wrapper
    out = func(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/optimizer.py", line 33, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/sgd.py", line 76, in step
    sgd(params_with_grad,
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/sgd.py", line 222, in sgd
    func(params,
  File "/opt/conda/lib/python3.8/site-packages/torch/optim/sgd.py", line 291, in _multi_tensor_sgd
    device_grads = torch._foreach_add(device_grads, device_params, alpha=weight_decay)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 2.02 GiB already allocated; 15.00 MiB free; 2.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 42535 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 42534) of binary: /opt/conda/bin/python
Traceback (most recent call last):
  File "/opt/conda/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/tmp/tmpox_opph9/workdir/dints_0/scripts/search.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-16_05:18:53
  host      : 957ff6064347
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 42534)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

EFinished test: test_ensemble (tests.test_auto3dseg_ensemble.TestEnsembleBuilder) (58.7s)
mingxin-zheng commented 1 year ago

Need to discuss with @dongyang0122 .

I am inclined to update the algorithm templates. What do you think @dongyang0122 ?

dongyang0122 commented 1 year ago

Which version of PyTorch should we use for developing to pass the unit test?

mingxin-zheng commented 1 year ago

We are using PyTorch 1.8 - 1.13. I think the PyTorch 1.8-1.12 tests are optional during the PR.

mingxin-zheng commented 1 year ago

Is it viable to detect the version of PyTorch in dints Auto3DSeg templates, and fall back to the previous scheduler if it is not causing too much of performance decrease? I'm also okay to skip the tests for now, but it will mean PyTorch 1.12 users will not be able to use Auto3DSeg.

dongyang0122 commented 1 year ago

I may need to re-write the scheduler logic manually to avoid the version issue.

wyli commented 1 year ago

ok, let's keep this ticket open and I'll make a workaround to skip the tests for torch <=1.12 to unblock the release candidate

wyli commented 1 year ago

the related tutorials https://github.com/Project-MONAI/tutorials are also broken

mingxin-zheng commented 1 year ago

Hi @dongyang0122 @wyli , I followed @dongyang0122 's thought partially to fix this issue in research-contribution: https://github.com/Project-MONAI/research-contributions/pull/204

The main issue there was that the class script is from PyTorch. So I include the full license in the script for legal.