Training consultation - Githubissues

When I train with Self-Critical Sequence Training (SCST) with the CXR-BERT reward, I set devices: 2 mbatch_size: 16 num_workers: 32 but encountered the following error: ''' (venv) [root@3dc54336e478 home]# dlhpcstarter -t mimic_cxr -c config/train/longitudinal_gen_prompt_cxr-bert.yaml --stages_module tools.stages --train Seed set to 0 PTL no. devices: 2. PTL no. nodes: 1. /usr/local/lib/python3.8/site-packages/lightning/fabric/connector.py:571: `precision=16` is supported for historical reasons but its usage is discouraged. Please set your precision to 16-mixed instead! Using 16bit Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores HPU available: False, using: 0 HPUs Description, Special token, Index bos_token, [BOS], 1 eos_token, [EOS], 2 unk_token, [UNK], 0 sep_token, [SEP], 3 pad_token, [PAD], 4 cls_token, [BOS], 1 mask_token, [MASK], 5 additional_special_token, [NF], 6 additional_special_token, [NI], 7 additional_special_token, [PMT], 8 additional_special_token, [PMT-SEP], 9 additional_special_token, [NPF], 10 additional_special_token, [NPI], 11 /home/modules/transformers/longitudinal_model/modelling_longitudinal.py:155: UserWarning: The encoder-to-decoder model was not warm-started before applying low-rank approximation. warnings.warn('The encoder-to-decoder model was not warm-started before applying low-rank approximation.') trainable params: 147,456 || all params: 80,916,528 || trainable%: 0.1822 /usr/local/lib/python3.8/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead. warnings.warn( Warm-starting using: /home/experiments/cxrmate/longitudinal_gt_prompt_tf/trial_0/epoch=19-step=78380-val_report_chexbert_f1_macro=0.371041.ckpt. /usr/local/lib/python3.8/site-packages/dlhpcstarter/utils.py:347: UserWarning: The "last" checkpoint does not exist, starting training from epoch 0. warnings.warn('The "last" checkpoint does not exist, starting training from epoch 0.') You are using a CUDA device ('Z100L') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision [rank: 0] Seed set to 0 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2 [rank: 1] Seed set to 0 PTL no. devices: 2. PTL no. nodes: 1. Description, Special token, Index bos_token, [BOS], 1 eos_token, [EOS], 2 unk_token, [UNK], 0 sep_token, [SEP], 3 pad_token, [PAD], 4 cls_token, [BOS], 1 mask_token, [MASK], 5 additional_special_token, [NF], 6 additional_special_token, [NI], 7 additional_special_token, [PMT], 8 additional_special_token, [PMT-SEP], 9 additional_special_token, [NPF], 10 additional_special_token, [NPI], 11 /home/modules/transformers/longitudinal_model/modelling_longitudinal.py:155: UserWarning: The encoder-to-decoder model was not warm-started before applying low-rank approximation. warnings.warn('The encoder-to-decoder model was not warm-started before applying low-rank approximation.') trainable params: 147,456 || all params: 80,916,528 || trainable%: 0.1822 /usr/local/lib/python3.8/site-packages/transformers/models/convnext/feature_extraction_convnext.py:28: FutureWarning: The class ConvNextFeatureExtractor is deprecated and will be removed in version 5 of Transformers. Please use ConvNextImageProcessor instead. warnings.warn( Warm-starting using: /home/experiments/cxrmate/longitudinal_gt_prompt_tf/trial_0/epoch=19-step=78380-val_report_chexbert_f1_macro=0.371041.ckpt. /usr/local/lib/python3.8/site-packages/dlhpcstarter/utils.py:347: UserWarning: The "last" checkpoint does not exist, starting training from epoch 0. warnings.warn('The "last" checkpoint does not exist, starting training from epoch 0.') [rank: 1] Seed set to 0 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 WARNING: Logging before InitGoogleLogging() is written to STDERR I0711 11:46:15.886372 31375 ProcessGroupNCCL.cpp:686] [Rank 1] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=226348544 WARNING: Logging before InitGoogleLogging() is written to STDERR I0711 11:46:15.892076 31223 ProcessGroupNCCL.cpp:686] [Rank 0] ProcessGroupNCCL initialization options:NCCL_ASYNC_ERROR_HANDLING: 1, NCCL_DESYNC_DEBUG: 0, NCCL_ENABLE_TIMING: 0, NCCL_BLOCKING_WAIT: 0, TIMEOUT(ms): 1800000, USE_HIGH_PRIORITY_STREAM: 0, TORCH_DISTRIBUTED_DEBUG: OFF, NCCL_DEBUG: OFF, ID=229697888

distributed_backend=nccl All distributed processes registered. Starting with 2 processes

I0711 11:46:16.570466 31223 ProcessGroupNCCL.cpp:1340] NCCL_DEBUG: N/A /usr/local/lib/python3.8/site-packages/lightning/pytorch/callbacks/model_checkpoint.py:652: Checkpoint directory /home/experiments/mimic_cxr/longitudinal_gen_prompt_cxr-bert/trial_0 exists and is not empty. /home/data/prompt.py:186: UserWarning: The number of examples is not divisible by the world size. Adding extra studies to account for this. This needs to be accounted for outside of the dataset. warnings.warn('The number of examples is not divisible by the world size. ' Traceback (most recent call last): File "/usr/local/bin/dlhpcstarter", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 126, in main submit(args, cmd_line_args, stages_fnc) File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 21, in submit stages_fnc(args) File "/home/tools/stages.py", line 85, in stages trainer.fit(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch return function(*args, *kwargs) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 948, in _run call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 96, in _call_setup_hook _call_lightning_module_hook(trainer, "setup", stage=fn) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 159, in _call_lightning_module_hook output = fn(args, kwargs) File "/home/modules/lightning_modules/longitudinal/scst/gen_prompt.py", line 66, in setup self.train_set = PreviousReportSubset( File "/home/data/prompt.py", line 73, in init self.allocate_subjects_to_rank(shuffle_subjects=False) File "/home/data/prompt.py", line 212, in allocate_subjects_to_rank assert len(set(self.examples)) == self.df.study_id.nunique() and \ AssertionError I0711 11:46:24.351401 31223 ProcessGroupNCCL.cpp:874] [Rank 0] Destroyed 1communicators on CUDA device 0 /home/data/prompt.py:186: UserWarning: The number of examples is not divisible by the world size. Adding extra studies to account for this. This needs to be accounted for outside of the dataset. warnings.warn('The number of examples is not divisible by the world size. ' Traceback (most recent call last): File "/usr/local/bin/dlhpcstarter", line 8, in sys.exit(main()) File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 126, in main submit(args, cmd_line_args, stages_fnc) File "/usr/local/lib/python3.8/site-packages/dlhpcstarter/main.py", line 21, in submit stages_fnc(args) File "/home/tools/stages.py", line 85, in stages trainer.fit(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 543, in fit call._call_and_handle_interrupt( File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/strategies/launchers/subprocess_script.py", line 105, in launch return function(*args, *kwargs) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 579, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/trainer.py", line 948, in _run call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 96, in _call_setup_hook _call_lightning_module_hook(trainer, "setup", stage=fn) File "/usr/local/lib/python3.8/site-packages/lightning/pytorch/trainer/call.py", line 159, in _call_lightning_module_hook output = fn(args, kwargs) File "/home/modules/lightning_modules/longitudinal/scst/gen_prompt.py", line 66, in setup self.train_set = PreviousReportSubset( File "/home/data/prompt.py", line 73, in init self.allocate_subjects_to_rank(shuffle_subjects=False) File "/home/data/prompt.py", line 212, in allocate_subjects_to_rank assert len(set(self.examples)) == self.df.study_id.nunique() and \ AssertionError I0711 11:46:25.112917 31375 ProcessGroupNCCL.cpp:874] [Rank 1] Destroyed 1communicators on CUDA device 1 ''' I want to ask how you set the parameters during training. I saw that your paper used 4×16GB NVIDIA Tesla P100 GPUs. I used 2×32GB NVIDIA V100 GPUs.And I set devices: 1 mbatch_size: 1 without error, but it is too slow. I look forward to your answer,thank you very much!

aehrc / cxrmate

Training consultation #8

distributed_backend=nccl All distributed processes registered. Starting with 2 processes