Open zwb0 opened 11 months ago
Hello, yes it is possible to run it on a single GPU. You need to edit train.sh
to run a command similar to the following:
nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \
python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \
./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz
note the changes in CUDA_VISIBLE_DEVICES
and --nproc_per_node
when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited the network_trainer
entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.
Also there might be a typo in train.py
, you might need to change the --local-rank
argument to--local_rank
, at least that was one of the issues in my case.
Hope this helps.
Hello, yes it is possible to run it on a single GPU. You need to edit
train.sh
to run a command similar to the following:nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \ python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \ ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz
note the changes in
CUDA_VISIBLE_DEVICES
and--nproc_per_node
when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited thenetwork_trainer
entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications.Also there might be a typo in
train.py
, you might need to change the--local-rank
argument to--local_rank
, at least that was one of the issues in my case.Hope this helps.
Hello, I change the --local-rank
argument to--local_rank
, but it still report error:
usage: train.py [-h] [--network NETWORK] [--network_trainer NETWORK_TRAINER] [--task TASK] [--task_pretrained TASK_PRETRAINED] [--fold FOLD]
[--model MODEL] [--disable_ds DISABLE_DS] [--resume RESUME] [-val] [-c] [-p P] [--use_compressed_data] [--deterministic]
[--fp32] [--dbs] [--npz] [--valbest] [--vallatest] [--find_lr] [--val_folder VAL_FOLDER] [--disable_saving]
[--disable_postprocessing_on_folds] [-pretrained_weights PRETRAINED_WEIGHTS] [--config FILE] [--batch_size BATCH_SIZE]
[--max_num_epochs MAX_NUM_EPOCHS] [--initial_lr INITIAL_LR] [--min_lr MIN_LR] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]]
[--weight_decay WEIGHT_DECAY] [--local_rank LOCAL_RANK] [--world-size WORLD_SIZE] [--rank RANK]
[--total_batch_size TOTAL_BATCH_SIZE] [--hdfs_base HDFS_BASE] [--optim_name OPTIM_NAME] [--lrschedule LRSCHEDULE]
[--warmup_epochs WARMUP_EPOCHS] [--val_final] [--is_ssl] [--is_spatial_aug_only] [--mask_ratio MASK_RATIO]
[--loss_name LOSS_NAME] [--plan_update PLAN_UPDATE] [--crop_size CROP_SIZE [CROP_SIZE ...]] [--reclip RECLIP [RECLIP ...]]
[--pretrained] [--disable_decoder] [--model_params MODEL_PARAMS] [--layer_decay LAYER_DECAY] [--drop_path PCT]
[--find_zero_weight_decay] [--n_class N_CLASS]
[--deep_supervision_scales DEEP_SUPERVISION_SCALES [DEEP_SUPERVISION_SCALES ...]] [--fix_ds_net_numpool] [--skip_grad_nan]
[--merge_femur] [--is_sigmoid] [--max_loss_cal MAX_LOSS_CAL]
train.py: error: unrecognized arguments: --local-rank=0
Could you help me?
Thank you.
Hello, yes it is possible to run it on a single GPU. You need to edit
train.sh
to run a command similar to the following:nnunet_use_progress_bar=1 CUDA_VISIBLE_DEVICES=0 \ python3 -m torch.distributed.launch --master_port=4322 --nproc_per_node=1 \ ./train.py --fold=${fold} --config=$CONFIG --resume='local_latest' --npz
note the changes in
CUDA_VISIBLE_DEVICES
and--nproc_per_node
when compared to the default values. Basically, this will use the default _DDP trainer if you haven't edited thenetwork_trainer
entry in your config file, but will still run on a single GPU. The proper way is to use a non-DDP trainer, but this will require some more modifications. Also there might be a typo intrain.py
, you might need to change the--local-rank
argument to--local_rank
, at least that was one of the issues in my case. Hope this helps.Hello, I change the
--local-rank
argument to--local_rank
, but it still report error: usage: train.py [-h] [--network NETWORK] [--network_trainer NETWORK_TRAINER] [--task TASK] [--task_pretrained TASK_PRETRAINED] [--fold FOLD] [--model MODEL] [--disable_ds DISABLE_DS] [--resume RESUME] [-val] [-c] [-p P] [--use_compressed_data] [--deterministic] [--fp32] [--dbs] [--npz] [--valbest] [--vallatest] [--find_lr] [--val_folder VAL_FOLDER] [--disable_saving] [--disable_postprocessing_on_folds] [-pretrained_weights PRETRAINED_WEIGHTS] [--config FILE] [--batch_size BATCH_SIZE] [--max_num_epochs MAX_NUM_EPOCHS] [--initial_lr INITIAL_LR] [--min_lr MIN_LR] [--opt_eps EPSILON] [--opt_betas BETA [BETA ...]] [--weight_decay WEIGHT_DECAY] [--local_rank LOCAL_RANK] [--world-size WORLD_SIZE] [--rank RANK] [--total_batch_size TOTAL_BATCH_SIZE] [--hdfs_base HDFS_BASE] [--optim_name OPTIM_NAME] [--lrschedule LRSCHEDULE] [--warmup_epochs WARMUP_EPOCHS] [--val_final] [--is_ssl] [--is_spatial_aug_only] [--mask_ratio MASK_RATIO] [--loss_name LOSS_NAME] [--plan_update PLAN_UPDATE] [--crop_size CROP_SIZE [CROP_SIZE ...]] [--reclip RECLIP [RECLIP ...]] [--pretrained] [--disable_decoder] [--model_params MODEL_PARAMS] [--layer_decay LAYER_DECAY] [--drop_path PCT] [--find_zero_weight_decay] [--n_class N_CLASS] [--deep_supervision_scales DEEP_SUPERVISION_SCALES [DEEP_SUPERVISION_SCALES ...]] [--fix_ds_net_numpool] [--skip_grad_nan] [--merge_femur] [--is_sigmoid] [--max_loss_cal MAX_LOSS_CAL] train.py: error: unrecognized arguments: --local-rank=0 Could you help me? Thank you.
Hello, im facing the same problem. Have u solved it?
@Heanhu @2DangFilthy
The arg change to --local_rank
in train.py
https://github.com/Beckschen/3D-TransUNet/blob/190fe40735b2a5f688264db8bc0c93d19b0b98ec/train.py#L109 I suggested apparently is not necessary. According to argparse's docs, internal hyphens in args are automatically converted to underscores. Perhaps try deleting any __pycache__
dirs you might have, sometimes these can cause issues. If you are running the train.sh
script, it should work.
I'm having the same issue! any suggestions on how to train without distribution besides these? Thank you.
Hi, I encountered problems with the distributed training. Can I train your model with a single GPU? Thanks a lot!