TypeError:__init__() got an unexpected keyword argument 'presistent_workers'

Jx-Tan commented 3 years ago

Describe the bug I use the UNETR/BTCV code to complete multi organ segmentation, but in Dataloader, I get this Error: TypeError:init() got an unexpected keyword argument 'presistent_workers'. It exist in .../monai/data/dataloader.py, line 87, in init **kwargs

To Reproduce Steps to reproduce the behavior:

Go to 'UNETR/BTCV'
Install 'monai==0.7.0 nibabel==3.1.1 tqdm==4.59.0 einops==0.3.2 tensorboardx==2.1'
Run commands 'python main.py --batch_size=1 --logdir=unetr_pretrained --optim_lr=1e-4 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --pretrained_dir='./pretrained_models/' --pretrained_model_name='UNETR_model_best_acc.pth' --resume_ckpt'

Expected behavior Start the train correctly.

Screenshots

Environment (please complete the following information):

OS ubuntu16.04
Python version Python 3.6
MONAI version [e.g. git commit hash] 0.7.0
CUDA/cuDNN version 10.2
GPU models and configuration Geforce Nvidia RTX 2080ti 11G

Additional context Add any other context about the problem here.

ahatamiz commented 3 years ago

Hi @55998

I tried to reproduce this but issue but was not successful. What is your PyTorch version ?

presistent_workers is an standard argument in native PyTorch dataloader and used here. The idea is to not shutdown the workers after 1 epoch of dataset is used. This should enhance the performance, but is not a critical component. I would suggest removing it for the moment until we can pin down the issue.

Thanks

Jx-Tan commented 3 years ago

Hi @ahatamiz

Thank you very much, my PyTorch version is 1.6.0. when I use this code, I remove this argument, and able to continue training.

But I encountered another bug. I hope you can help me to solve it.

Describe the bug RuntimeError: Unsupported data type for NCCL process group, in ..../torch/distributed/distributed_c10d.py line 1185, in all_gather work = _default_pg.allgather([tensor_list], [tensor])

To Reproduce Steps to reproduce the behavior:

Go to 'UNETR/BTCV' Install 'monai==0.7.0 nibabel==3.1.1 tqdm==4.59.0 einops==0.3.2 tensorboardx==2.1' Run commands 'python main.py --batch_size=1 --logdir=unetr_pretrained --optim_lr=2e-4 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --pretrained_dir='./pretrained_models/' --pretrained_model_name='UNETR_model_best_acc.pth' --resume_ckpt' --distributed Expected behavior Start the train correctly.

Screenshots Console output： 1635478307(1)

Related code： 1635478465(1) in trainer.py line 85

1635478566(1) in utils.py line 56

1635478631(1) in utils.py line 65

Environment (please complete the following information): OS ubuntu 16.04 Python version Python 3.6 MONAI version [e.g. git commit hash] 0.7.0 CUDA/cuDNN version 10.2 GPU models and configuration 2 pieces of Geforce Nvidia RTX 2080ti 11G Additional context Pytorch version is 1.6.0

ahatamiz commented 3 years ago

Hi @55998

I was able to reproduce this issue.

This issue (and the first one) are both caused by the PyTorch version. I recommend using torch==1.9.1 to avoid both of these issues.

I will update the requirements.txt to reflect this.

Thanks

Jx-Tan commented 3 years ago

Hi @ahatamiz

Thank you for your reply. Now I can correctly use this code.

But I got a new issue. I use two Nvidia Geforce RTX 2080Ti, and torch==1.9.1, the data is same with readme, and training Parameters is --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed I set the max_epochs is 3000. But when finish the training, I got the Final traing 2999/2999 loss:0.5940, Final validation 2999/2999 acc 0.8024047, Training Finished! Best Accuracy: 0.0 So, witch problem with it?

1635560067(1)

Jx-Tan commented 3 years ago

Hi @ahatamiz

I Used the best_model to test, I get the AVG_DICE was 0.79; Without the pre_training model, I set 3000epochs, and the final validation ACC was 0.80 The data display seems a little low. What may be the reason? The data set I use is abdomen CT raw_data in btcv challenge.

hnjzbss commented 3 years ago

Hi @ahatamiz

Thank you for your reply. Now I can correctly use this code.

But I got a new issue. I use two Nvidia Geforce RTX 2080Ti, and torch==1.9.1, the data is same with readme, and training Parameters is --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 --lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed I set the max_epochs is 3000. But when finish the training, I got the Final traing 2999/2999 loss:0.5940, Final validation 2999/2999 acc 0.8024047, Training Finished! Best Accuracy: 0.0 So, witch problem with it?

嗨@ahatamiz

感谢你的回复。现在我可以正确使用这段代码了。

但是我遇到了一个新问题。我用了两个Nvidia Geforce RTX 2080Ti，torch==1.9.1，数据和readme一样，训练参数是 --feature_size=32 --batch_size=1 --logdir=unetr_test --optim_lr=2e-4 - -lrschedule=warmup_cosine --infer_overlap=0.5 --save_checkpoint --data_dir=/dataset/dataset0/ --distributed 我设置的max_epochs是3000。但是当完成训练时，我得到了Final traing 2999/2999 loss:0.5940, Final验证 2999/2999 acc 0.8024047，培训完成！最佳准确度：0.0 那么，有问题吗？

How did you solve this problem?

ahatamiz commented 3 years ago

Hi @55998

I was able to reproduce this issue.

Based on this, I have submitted a new pull request (#18) that addresses this issue. Please re-try once it's merged.

Thanks

Project-MONAI / research-contributions

TypeError:init() got an unexpected keyword argument 'presistent_workers' #15

Project-MONAI / research-contributions

TypeError:__init__() got an unexpected keyword argument 'presistent_workers' #15

TypeError:init() got an unexpected keyword argument 'presistent_workers' #15