Closed Jx-Tan closed 3 years ago
Hi @55998
I tried to reproduce this but issue but was not successful. What is your PyTorch version ?
presistent_workers
is an standard argument in native PyTorch dataloader and used here. The idea is to not shutdown the workers after 1 epoch of dataset is used. This should enhance the performance, but is not a critical component. I would suggest removing it for the moment until we can pin down the issue.
Thanks
Hi @ahatamiz
Thank you very much, my PyTorch version is 1.6.0. when I use this code, I remove this argument, and able to continue training.
But I encountered another bug. I hope you can help me to solve it.
Describe the bug RuntimeError: Unsupported data type for NCCL process group, in ..../torch/distributed/distributed_c10d.py line 1185, in all_gather work = _default_pg.allgather([tensor_list], [tensor])
To Reproduce Steps to reproduce the behavior:
Go to 'UNETR/BTCV' Install 'monai==0.7.0 nibabel==3.1.1 tqdm==4.59.0 einops==0.3.2 tensorboardx==2.1' Run commands 'python main.py --batch_size=1 --logdir=unetr_pretrained --optim_lr=2e-4 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --pretrained_dir='./pretrained_models/' --pretrained_model_name='UNETR_model_best_acc.pth' --resume_ckpt' --distributed Expected behavior Start the train correctly.
Screenshots Console output:
Related code: in trainer.py line 85
in utils.py line 56
in utils.py line 65
Environment (please complete the following information): OS ubuntu 16.04 Python version Python 3.6 MONAI version [e.g. git commit hash] 0.7.0 CUDA/cuDNN version 10.2 GPU models and configuration 2 pieces of Geforce Nvidia RTX 2080ti 11G Additional context Pytorch version is 1.6.0
Hi @55998
I was able to reproduce this issue.
This issue (and the first one) are both caused by the PyTorch version. I recommend using torch==1.9.1
to avoid both of these issues.
I will update the requirements.txt
to reflect this.
Thanks
Hi @ahatamiz
Thank you for your reply. Now I can correctly use this code.
But I got a new issue. I use two Nvidia Geforce RTX 2080Ti, and torch==1.9.1, the data is same with readme, and training Parameters is --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed I set the max_epochs is 3000. But when finish the training, I got the Final traing 2999/2999 loss:0.5940, Final validation 2999/2999 acc 0.8024047, Training Finished! Best Accuracy: 0.0 So, witch problem with it?
Hi @ahatamiz
I Used the best_model to test, I get the AVG_DICE was 0.79; Without the pre_training model, I set 3000epochs, and the final validation ACC was 0.80 The data display seems a little low. What may be the reason? The data set I use is abdomen CT raw_data in btcv challenge.
Hi @ahatamiz
Thank you for your reply. Now I can correctly use this code.
But I got a new issue. I use two Nvidia Geforce RTX 2080Ti, and torch==1.9.1, the data is same with readme, and training Parameters is --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed I set the max_epochs is 3000. But when finish the training, I got the Final traing 2999/2999 loss:0.5940, Final validation 2999/2999 acc 0.8024047, Training Finished! Best Accuracy: 0.0 So, witch problem with it?
嗨@ahatamiz
感谢你的回复。现在我可以正确使用这段代码了。
但是我遇到了一个新问题。我用了两个Nvidia Geforce RTX 2080Ti,torch==1.9.1,数据和readme一样,训练参数是 --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 - -lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed 我设置的max_epochs是3000。但是当完成训练时,我得到了Final traing 2999/2999 loss:0.5940, Final验证 2999/2999 acc 0.8024047,培训完成!最佳准确度:0.0 那么,有问题吗?
How did you solve this problem?
Hi @55998
I was able to reproduce this issue.
Based on this, I have submitted a new pull request (#18) that addresses this issue. Please re-try once it's merged.
Thanks
Describe the bug I use the UNETR/BTCV code to complete multi organ segmentation, but in Dataloader, I get this Error: TypeError:init() got an unexpected keyword argument 'presistent_workers'. It exist in .../monai/data/dataloader.py, line 87, in init **kwargs
To Reproduce Steps to reproduce the behavior:
Expected behavior Start the train correctly.
Screenshots
Environment (please complete the following information):
Additional context Add any other context about the problem here.