Closed shenlou11 closed 1 month ago
When trained to epoch=2, the memory of both Gpus is 40G+
40g memory useage is normal, because each gpu each worker consumes memory
40g memory useage is normal, because each gpu each worker consumes memory40g内存使用是正常的,因为每个gpu每个worker都消耗内存
Why in the same batchsize same num_workers situation, 1gpu, memory is 20G+? 2 gpu, each card memory is 40G+
40g memory useage is normal, because each gpu each worker consumes memory40g内存使用是正常的,因为每个gpu每个worker都消耗内存
Why in the same batchsize same num_workers situation, 1gpu, memory is 20G+? 2 gpu, each card, the memory is 40G+
Normally it should be 2 Gpus, each card the memory is 20G+ ?
Describe the bug During single-machine multi-GPU training, the memory changes abnormally. Training with 2 docker containers, which are bound to 1 Gpus and 2 GPUs batchsize=2,single-machine 1 gpu is used for training, GPU memory is 26483 MIB batchsize=2,single-machine 2 gpus is used for training, GPU 0 memory is 38335MB、GPU 1 memory is 45733MB
Use the default Settings,cosyvoice.fromscratch.yaml,batch_type: 'static',batchsize=2
train.sh:
stage=5 stop_stage=5 pretrained_model_dir=./pretrained_models/CosyVoice-300M
export CUDA_VISIBLE_DEVICES="2"
num_gpus=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
num_gpus=2 job_id=1986 dist_backend="nccl" num_workers=2 prefetch=100 train_engine=torch_ddp if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then echo "Run train. We only support llm traning for now. If your want to train from scratch, please use conf/cosyvoice.fromscratch.yaml" if [ $train_engine == 'deepspeed' ]; then echo "Notice deepspeed has its own optimizer config. Modify conf/ds_stage2.json if necessary" fi
cat examples/libritts/cosyvoice/data/{train-clean-100,train-clean-360,train-other-500}/parquet/data.list > examples/libritts/cosyvoice/data/train.data.list
cat examples/libritts/cosyvoice/data/{dev-clean,dev-other}/parquet/data.list > examples/libritts/cosyvoice/data/dev.data.list
for model in flow; do torchrun --nnodes=1 --nproc_per_node=$num_gpus \ --rdzv_id=$job_id --rdzv_backend="c10d" --rdzv_endpoint="localhost:0" \ train.py \ --train_engine $train_engine \ --config examples/libritts-test/cosyvoice/conf/cosyvoice.fromscratch.yaml \ --train_data examples/libritts-test/cosyvoice/data/train.data.list \ --cv_data examples/libritts-test/cosyvoice/data/dev.data.list \ --model $model \ --model_dir
pwd
/exp/cosyvoice/$model/$train_engine \ --tensorboard_dirpwd
/tensorboard/cosyvoice/$model/$train_engine \ --ddp.dist_backend $dist_backend \ --num_workers ${num_workers} \ --prefetch ${prefetch} \ --pin_memory \ --deepspeed_config examples/libritts/cosyvoice/conf/ds_stage2.json \ --deepspeed.save_states model+optimizer done fiWhy does memory get much larger when training on multiple Gpus?