Open veer5551 opened 2 years ago
Hi @ppwwyyxx, any chance you might look into this? I am restricted to train models.
Thanks!
Hello, I'm just curious if the LossEvalHook works on multiple GPUs. In my turns it hangs after calculate validation loss.
I used this fix @zensenlon. It works, the best checkpoint is saved after the evaluation, and the training resumes, but the threads initialized for evaluation and best checkpointer are not released and are locked in.
I used this fix @zensenlon. It works, the best checkpoint is saved after the evaluation, and the training resumes, but the threads initialized for evaluation and best checkpointer are not released and are locked in.
I used the fix mentioned in that post, but I can no longer see my validation loss in the tensorboard. Does your implementation log validation loss correctly?
@ShreyasSkandanS at the end did you manage to make it work? I mean the issue of the validation loss vanishing from the tensorboard?
Instructions To Reproduce the Issue: (Multi-GPU training with validation and best checkpointer hook)
Codes: 1.a lossEvalHook.py
1.b myTrainer.py
What exact command you run: python mytrainer.py
Full logs or other relevant observations:
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/data/detectron2/aaa/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap File "/data/detectron2/aaa/lib/python3.6/site-packages/detectron2/engine/launch.py", line 126, in _distributed_worker File "/data/detectron2/detectron2_training_scripts/multi-gpu_training_detectron2.py", line 253, in main return trainer.train() File "/data/detectron2/aaa/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 484, in train File "/data/detectron2/aaa/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 150, in train File "/data/detectron2/aaa/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 180, in after_step File "/data/detectron2/detectron2_training_scripts/lossEvalHook.py", line 70, in after_step File "/data/detectron2/detectron2_training_scripts/lossEvalHook.py", line 28, in _do_loss_eval File "/data/detectron2/aaa/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 521, in next File "/data/detectron2/aaa/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data File "/data/detectron2/aaa/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data File "/data/detectron2/aaa/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 1023, in _try_get_data RuntimeError: Too many open files. Communication with the workers is no longer possible. Please increase the limit using
ulimit -n
in the shell or change the sharing strategy by callingtorch.multiprocessing.set_sharing_strategy('file_system')
at the beginning of your codeimport torch.multiprocessing torch.multiprocessing.set_sharing_strategy('file_system')
sys.platform linux Python 3.6.9 (default, Jan 26 2021, 15:33:00) [GCC 8.4.0] numpy 1.19.5 detectron2 0.6 @/data/detectron2/aaa/lib/python3.6/site-packages/detectron2 Compiler GCC 7.3 CUDA compiler CUDA 11.1 detectron2 arch flags 3.7, 5.0, 5.2, 6.0, 6.1, 7.0, 7.5, 8.0, 8.6 DETECTRON2_ENV_MODULE
PyTorch 1.9.0+cu111 @/data/detectron2/aaa/lib/python3.6/site-packages/torch
PyTorch debug build False
GPU available Yes
GPU 0,1,2,3,4,5,6,7 A100-PCIE-40GB (arch=8.0)
Driver version 460.91.03
CUDA_HOME /usr/local/cuda
Pillow 8.4.0
torchvision 0.10.0+cu111 @/data/detectron2/aaa/lib/python3.6/site-packages/torchvision
torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6
fvcore 0.1.5.post20220119
iopath 0.1.9
cv2 4.5.5
PyTorch built with:
Testing NCCL connectivity ... this should not hang. NCCL succeeded.