Open Ashbajawed opened 1 month ago
DId you solved this? Im facing the same problem.
DId you solved this? Im facing the same problem.
unfortunately no
--preprocessing_num_workers 100 调小到4 --gradient_accumulation_steps 16调小到1 我是这样勉强能泡,但是训练效果不好,因为batchsize太小了。我使用的是4张a100,80G
I have been using 4 GPUs (from ml.p4d.24xlarge / ml.p4de.24xlarge ) in aws server but still getting error
torchrun \ --nnode 1 \ --nproc_per_node 4 \ --node_rank 0 \ --master_addr "localhost" \ --master_port 12345 \ speechgpt/src/train/ma_pretrain.py \ --bf16 True \ --block_size 1024 \ --model_name_or_path "${METAROOT}" \ --train_file ${DATAROOT}/train.txt \ --validation_file ${DATAROOT}/dev.txt \ --do_train \ --do_eval \ --output_dir "${OUTROOT}" \ --preprocessing_num_workers 100 \ --overwrite_output_dir \ --per_device_eval_batch_size 2 \ --per_device_train_batch_size 2 \ --gradient_accumulation_steps 16 \ --num_train_epochs 3 \ --log_level debug \ --logging_steps 1 \ --save_steps 300 \ --cache_dir ${CACHEROOT} \ --fsdp "full_shard auto_wrap" \ --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \