Can I train your model without distribution?

zwb0 commented 11 months ago

Hi, I encountered problems with the distributed training. Can I train your model with a single GPU? Thanks a lot!

dimitar10 commented 10 months ago

Hello, yes it is possible to run it on a single GPU. You need to edit train.sh to run a command similar to the following:

nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
        python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
        ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz

note the changes in CUDA_VISIBLE_DEVICESand --nproc_per_node when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.

Also there might be a typo in train.py, you might need to change the --local-rank argument to--local_rank, at least that was one of the issues in my case.

Hope this helps.

Heanhu commented 8 months ago

Hello, yes it is possible to run it on a single GPU. You need to edit train.sh to run a command similar to the following:
nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
        python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
        ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz
note the changes in CUDA_VISIBLE_DEVICESand --nproc_per_node when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.

Also there might be a typo in train.py, you might need to change the --local-rank argument to--local_rank, at least that was one of the issues in my case.

Hope this helps.

Hello， I change the --local-rank argument to--local_rank, but it still report error: usage: train.py [-h] [--network NETWORK] [--network_trainer NETWORK_TRAINER] [--task TASK] [--task_pretrained TASK_PRETRAINED] [--fold FOLD] [--model MODEL] [--disable_ds DISABLE_DS] [--resume RESUME] [-val] [-c] [-p P] [--use_compressed_data] [--deterministic] [--fp32] [--dbs] [--npz] [--valbest] [--vallatest] [--find_lr] [--val_folder VAL_FOLDER] [--disable_saving] [--disable_postprocessing_on_folds] [-pretrained_weights PRETRAINED_WEIGHTS] [--config FILE] [--batch_size BATCH_SIZE] [--max_num_epochs MAX_NUM_EPOCHS] [--initial_lr INITIAL_LR] [--min_lr MIN_LR] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]] [--weight_decay WEIGHT_DECAY] [--local_rank LOCAL_RANK] [--world-size WORLD_SIZE] [--rank RANK] [--total_batch_size TOTAL_BATCH_SIZE] [--hdfs_base HDFS_BASE] [--optim_name OPTIM_NAME] [--lrschedule LRSCHEDULE] [--warmup_epochs WARMUP_EPOCHS] [--val_final] [--is_ssl] [--is_spatial_aug_only] [--mask_ratio MASK_RATIO] [--loss_name LOSS_NAME] [--plan_update PLAN_UPDATE] [--crop_size CROP_SIZE [CROP_SIZE ...]] [--reclip RECLIP [RECLIP ...]] [--pretrained] [--disable_decoder] [--model_params MODEL_PARAMS] [--layer_decay LAYER_DECAY] [--drop_path PCT] [--find_zero_weight_decay] [--n_class N_CLASS] [--deep_supervision_scales DEEP_SUPERVISION_SCALES [DEEP_SUPERVISION_SCALES ...]] [--fix_ds_net_numpool] [--skip_grad_nan] [--merge_femur] [--is_sigmoid] [--max_loss_cal MAX_LOSS_CAL] train.py: error: unrecognized arguments: --local-rank=0 Could you help me? Thank you.

2DangFilthy commented 7 months ago

Hello, yes it is possible to run it on a single GPU. You need to edit train.sh to run a command similar to the following:
nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
        python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
        ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz
note the changes in CUDA_VISIBLE_DEVICESand --nproc_per_node when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications. Also there might be a typo in train.py, you might need to change the --local-rank argument to--local_rank, at least that was one of the issues in my case. Hope this helps.
Hello， I change the --local-rank argument to--local_rank, but it still report error: usage: train.py [-h] [--network NETWORK] [--network_trainer NETWORK_TRAINER] [--task TASK] [--task_pretrained TASK_PRETRAINED] [--fold FOLD] [--model MODEL] [--disable_ds DISABLE_DS] [--resume RESUME] [-val] [-c] [-p P] [--use_compressed_data] [--deterministic] [--fp32] [--dbs] [--npz] [--valbest] [--vallatest] [--find_lr] [--val_folder VAL_FOLDER] [--disable_saving] [--disable_postprocessing_on_folds] [-pretrained_weights PRETRAINED_WEIGHTS] [--config FILE] [--batch_size BATCH_SIZE] [--max_num_epochs MAX_NUM_EPOCHS] [--initial_lr INITIAL_LR] [--min_lr MIN_LR] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]] [--weight_decay WEIGHT_DECAY] [--local_rank LOCAL_RANK] [--world-size WORLD_SIZE] [--rank RANK] [--total_batch_size TOTAL_BATCH_SIZE] [--hdfs_base HDFS_BASE] [--optim_name OPTIM_NAME] [--lrschedule LRSCHEDULE] [--warmup_epochs WARMUP_EPOCHS] [--val_final] [--is_ssl] [--is_spatial_aug_only] [--mask_ratio MASK_RATIO] [--loss_name LOSS_NAME] [--plan_update PLAN_UPDATE] [--crop_size CROP_SIZE [CROP_SIZE ...]] [--reclip RECLIP [RECLIP ...]] [--pretrained] [--disable_decoder] [--model_params MODEL_PARAMS] [--layer_decay LAYER_DECAY] [--drop_path PCT] [--find_zero_weight_decay] [--n_class N_CLASS] [--deep_supervision_scales DEEP_SUPERVISION_SCALES [DEEP_SUPERVISION_SCALES ...]] [--fix_ds_net_numpool] [--skip_grad_nan] [--merge_femur] [--is_sigmoid] [--max_loss_cal MAX_LOSS_CAL] train.py: error: unrecognized arguments: --local-rank=0 Could you help me? Thank you.

Hello, im facing the same problem. Have u solved it?

dimitar10 commented 7 months ago

@Heanhu @2DangFilthy

The arg change to --local_rank in train.py https://github.com/Beckschen/3D-TransUNet/blob/190fe40735b2a5f688264db8bc0c93d19b0b98ec/train.py#L109 I suggested apparently is not necessary. According to argparse's docs, internal hyphens in args are automatically converted to underscores. Perhaps try deleting any __pycache__ dirs you might have, sometimes these can cause issues. If you are running the train.sh script, it should work.

mariem-m11 commented 3 months ago

I'm having the same issue! any suggestions on how to train without distribution besides these? Thank you.

Beckschen / 3D-TransUNet

Can I train your model without distribution? #13