THUDM / ChatGLM2-6B

ChatGLM2-6B: An Open Bilingual Chat LLM | 开源双语对话语言模型
Other
15.68k stars 1.85k forks source link

[BUG/Help] <微调bug> #463

Open maozixi1 opened 1 year ago

maozixi1 commented 1 year ago

Is there an existing issue for this?

Current Behavior

顺利部署,但微调时出错,报错如下: Traceback (most recent call last): File "/glm2/code/ChatGLM2-6B-main/ptuning/main.py", line 411, in main() File "/glm2/code/ChatGLM2-6B-main/ptuning/main.py", line 98, in main raw_datasets = load_dataset( File "/root/miniconda3/envs/glm2/lib/python3.10/site-packages/datasets/load.py", line 1782, in load_dataset builder_instance.download_and_prepare( File "/root/miniconda3/envs/glm2/lib/python3.10/site-packages/datasets/builder.py", line 810, in download_and_prepare raise OSError( OSError: Not enough disk space. Needed: Unknown size (download: Unknown size, generated: Unknown size, post-processed: Unknown size) ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 28466) of binary: /root/miniconda3/envs/glm2/bin/python Traceback (most recent call last): File "/root/miniconda3/envs/glm2/bin/torchrun", line 8, in sys.exit(main()) File "/root/miniconda3/envs/glm2/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, **kwargs) File "/root/miniconda3/envs/glm2/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/root/miniconda3/envs/glm2/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/root/miniconda3/envs/glm2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/root/miniconda3/envs/glm2/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

main.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2023-08-16_06:17:56 host : 8e4d2e193950 rank : 0 (local_rank: 0) exitcode : 1 (pid: 28466) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ### Expected Behavior _No response_ ### Steps To Reproduce PRE_SEQ_LEN=128 LR=2e-2 NUM_GPUS=1 torchrun --standalone --nnodes=1 --nproc_per_node=$NUM_GPUS main.py \ --do_train \ --train_file AdvertiseGen/train.json \ --validation_file AdvertiseGen/dev.json \ --preprocessing_num_workers 10 \ --prompt_column content \ --response_column summary \ --overwrite_cache \ --model_name_or_path /glm2/code/ChatGLM2-6B-main/model1 \ --output_dir output/adgen-chatglm2-6b-pt-$PRE_SEQ_LEN-$LR \ --overwrite_output_dir \ --max_source_length 64 \ --max_target_length 128 \ --per_device_train_batch_size 1 \ --per_device_eval_batch_size 1 \ --gradient_accumulation_steps 16 \ --predict_with_generate \ --max_steps 3000 \ --logging_steps 10 \ --save_steps 1000 \ --learning_rate $LR \ --pre_seq_len $PRE_SEQ_LEN \ --quantization_bit 4 运行bash train.sh ### Environment ```markdown - OS:Ubuntu - Python:3.10.10 - Transformers:4.27.1 - PyTorch:2.0.1 - CUDA Support (`python -c "import torch; print(torch.cuda.is_available())"`) :true ``` ### Anything else? _No response_
ZCQ0628 commented 1 year ago

+1+1 想请问一下解决了吗