torchrun results in RuntimeError: Failed to evaluate ConfigExpression: "$__local_refs['network_def'].to(__local_refs['device'])"

idinsmore1 commented 5 months ago

Describe the bug When trying to use the wholeBody_ct_segmentation bundle for multi-gpu distributed training, the CacheDataset loads properly but before training epochs begin, I get the error RuntimeError: Failed to evaluate ConfigExpression: "$local_refs['network_def'].to(local_refs['device'])".

To Reproduce Steps to reproduce the behavior:

Install monai and the proper dependencies
run CUDA_VISIBLE_DEVICES="0,1,2,3,4" torchrun --standalone --nnodes=1 --nproc_per_node=5 -m monai.bundle run --dataset_dir ../totalsegmentator_dataset_monai --config_file "['configs/train.json', 'configs/multi_gpu_train.json']"

Expected behavior I should be able to have each GPU start training, but no training loop begins.

Environment

MONAI version: 1.3.0 Numpy version: 1.26.3 Pytorch version: 2.1.0.post300 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False MONAI rev id: 865972f7a791bf7b42efbcd87c8402bd865b329e MONAI file: /home//mambaforge/envs/monai/lib/python3.9/site-packages/monai/init.py

Optional dependencies: Pytorch Ignite version: 0.4.13 ITK version: NOT INSTALLED or UNKNOWN VERSION. Nibabel version: 5.2.0 scikit-image version: NOT INSTALLED or UNKNOWN VERSION. scipy version: NOT INSTALLED or UNKNOWN VERSION. Pillow version: 10.2.0 Tensorboard version: 2.15.1 gdown version: NOT INSTALLED or UNKNOWN VERSION. TorchVision version: NOT INSTALLED or UNKNOWN VERSION. tqdm version: 4.66.1 lmdb version: NOT INSTALLED or UNKNOWN VERSION. psutil version: 5.9.7 pandas version: 2.1.4 einops version: NOT INSTALLED or UNKNOWN VERSION. transformers version: NOT INSTALLED or UNKNOWN VERSION. mlflow version: NOT INSTALLED or UNKNOWN VERSION. pynrrd version: NOT INSTALLED or UNKNOWN VERSION. clearml version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit: https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

System: Linux Linux version: Ubuntu 20.04.6 LTS Platform: Linux-5.4.0-146-generic-x86_64-with-glibc2.31 Processor: x86_64 Machine: x86_64 Python version: 3.9.18 Process name: python Command: ['python', '-c', 'import monai; monai.config.print_debug_info()'] Open files: [] Num physical CPUs: 48 Num logical CPUs: 96 Num usable CPUs: 96 CPU usage (%): [100.0, 8.3, 3.5, 3.5, 7.6, 2.1, 3.5, 0.7, 0.0, 0.7, 0.7, 1.4, 0.0, 0.7, 0.7, 2.1, 0.0, 0.0, 0.7, 0.7, 0.7, 0.7, 30.6, 10.3, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.3, 0.0, 1.4, 1.4, 15.4, 0.0, 0.0, 0.0, 100.0, 100.0, 0.0, 0.0, 100.0, 0.7, 1.4, 4.9, 6.2, 0.7, 0.7, 0.0, 25.0, 0.7, 1.4, 0.7, 0.7, 0.7, 0.0, 1.4, 0.0, 0.0, 1.4, 1.4, 2.8, 0.0, 2.1, 2.8, 1.4, 0.7, 1.4, 17.2, 0.0, 0.7, 0.0, 0.0, 0.0, 0.7, 100.0, 0.0, 15.3, 77.1, 23.6, 4.2, 0.7, 2.1, 0.7, 0.0, 0.0, 0.7, 0.0, 0.0] CPU freq. (MHz): 3057 Load avg. in last 1, 5, 15 mins (%): [9.6, 13.7, 15.0] Disk usage (%): 38.1 Avg. sensor temp. (Celsius): UNKNOWN for given OS Total physical memory (GB): 1510.6 Available memory (GB): 1361.0 Used memory (GB): 121.8

Num GPUs: 16 Has CUDA: True CUDA version: 11.2 cuDNN enabled: True NVIDIA_TF32_OVERRIDE: None TORCH_ALLOW_TF32_CUBLAS_OVERRIDE: None cuDNN version: 8800 Current device: 0 Library compiled for CUDA architectures: ['sm_35', 'sm_50', 'sm_60', 'sm_61', 'sm_70', 'sm_75', 'sm_80', 'sm_86', 'compute_86'] GPU 0 Name: Tesla V100-SXM3-32GB GPU 0 Is integrated: False GPU 0 Is multi GPU board: False GPU 0 Multi processor count: 80 GPU 0 Total memory (GB): 31.7 GPU 0 CUDA capability (maj.min): 7.0 GPU 1 Name: Tesla V100-SXM3-32GB GPU 1 Is integrated: False GPU 1 Is multi GPU board: False GPU 1 Multi processor count: 80 GPU 1 Total memory (GB): 31.7 GPU 1 CUDA capability (maj.min): 7.0 GPU 2 Name: Tesla V100-SXM3-32GB GPU 2 Is integrated: False GPU 2 Is multi GPU board: False GPU 2 Multi processor count: 80 GPU 2 Total memory (GB): 31.7 GPU 2 CUDA capability (maj.min): 7.0 GPU 3 Name: Tesla V100-SXM3-32GB GPU 3 Is integrated: False GPU 3 Is multi GPU board: False GPU 3 Multi processor count: 80 GPU 3 Total memory (GB): 31.7 GPU 3 CUDA capability (maj.min): 7.0 GPU 4 Name: Tesla V100-SXM3-32GB GPU 4 Is integrated: False GPU 4 Is multi GPU board: False GPU 4 Multi processor count: 80 GPU 4 Total memory (GB): 31.7 GPU 4 CUDA capability (maj.min): 7.0 GPU 5 Name: Tesla V100-SXM3-32GB GPU 5 Is integrated: False GPU 5 Is multi GPU board: False GPU 5 Multi processor count: 80 GPU 5 Total memory (GB): 31.7 GPU 5 CUDA capability (maj.min): 7.0 GPU 6 Name: Tesla V100-SXM3-32GB GPU 6 Is integrated: False GPU 6 Is multi GPU board: False GPU 6 Multi processor count: 80 GPU 6 Total memory (GB): 31.7 GPU 6 CUDA capability (maj.min): 7.0 GPU 7 Name: Tesla V100-SXM3-32GB GPU 7 Is integrated: False GPU 7 Is multi GPU board: False GPU 7 Multi processor count: 80 GPU 7 Total memory (GB): 31.7 GPU 7 CUDA capability (maj.min): 7.0 GPU 8 Name: Tesla V100-SXM3-32GB GPU 8 Is integrated: False GPU 8 Is multi GPU board: False GPU 8 Multi processor count: 80 GPU 8 Total memory (GB): 31.7 GPU 8 CUDA capability (maj.min): 7.0 GPU 9 Name: Tesla V100-SXM3-32GB GPU 9 Is integrated: False GPU 9 Is multi GPU board: False GPU 9 Multi processor count: 80 GPU 9 Total memory (GB): 31.7 GPU 9 CUDA capability (maj.min): 7.0 GPU 10 Name: Tesla V100-SXM3-32GB GPU 10 Is integrated: False GPU 10 Is multi GPU board: False GPU 10 Multi processor count: 80 GPU 10 Total memory (GB): 31.7 GPU 10 CUDA capability (maj.min): 7.0 GPU 11 Name: Tesla V100-SXM3-32GB GPU 11 Is integrated: False GPU 11 Is multi GPU board: False GPU 11 Multi processor count: 80 GPU 11 Total memory (GB): 31.7 GPU 11 CUDA capability (maj.min): 7.0 GPU 12 Name: Tesla V100-SXM3-32GB GPU 12 Is integrated: False GPU 12 Is multi GPU board: False GPU 12 Multi processor count: 80 GPU 12 Total memory (GB): 31.7 GPU 12 CUDA capability (maj.min): 7.0 GPU 13 Name: Tesla V100-SXM3-32GB GPU 13 Is integrated: False GPU 13 Is multi GPU board: False GPU 13 Multi processor count: 80 GPU 13 Total memory (GB): 31.7 GPU 13 CUDA capability (maj.min): 7.0 GPU 14 Name: Tesla V100-SXM3-32GB GPU 14 Is integrated: False GPU 14 Is multi GPU board: False GPU 14 Multi processor count: 80 GPU 14 Total memory (GB): 31.7 GPU 14 CUDA capability (maj.min): 7.0 GPU 15 Name: Tesla V100-SXM3-32GB GPU 15 Is integrated: False GPU 15 Is multi GPU board: False GPU 15 Multi processor count: 80 GPU 15 Total memory (GB): 31.7 GPU 15 CUDA capability (maj.min): 7.0

idinsmore1 commented 5 months ago

it's also worth noting that non-distributed training works perfectly out of the box in this environment

idinsmore1 commented 5 months ago

Futher testing using docker also ends with the following error RuntimeError: Failed to evaluate ConfigExpression: "$__local_refs['train::trainer'].run()"

KumoLiu commented 5 months ago

Hi @idinsmore1, I can't reproduce the error. You can try to debug each component like this to see whether network_def is write correctly.

from monai.bundle import ConfigParser

parser = ConfigParser()
parser.read_config(f=["/workspace/Code/model-zoo/models/wholeBody_ct_segmentation/configs/train.json"])
# parse the structured config content
parser.parse()
# instantiate the network component and print the network structure
net = parser.get_parsed_content("network_def")
print(net)

Hope it helps, thanks.

idinsmore1 commented 5 months ago

Hi @KumoLiu, thanks for the quick response. For whatever reason, my slightly customized environment just does not seem to want to work properly in multi-gpu training. I solved the issue by exactly recreating the environment within the metadata.json file, everything seems to be working now. Thanks!

Project-MONAI / model-zoo

torchrun results in RuntimeError: Failed to evaluate ConfigExpression: "$__local_refs['network_def'].to(__local_refs['device'])" #545