求助：Lora微调时在Running training这一步就卡主了，一直是0%

xixia03 commented 8 months ago

(fine-tuning) root@cg:/home/jovyan/work/chatglm/ChatGLM3-main/finetune_demo# python finetune_hf.py data/ /home/jovyan/work/chatglm/chatglm3-6b configs/lora.yaml [4pdvGPU Warn(38231:140083313671040:libvgpu.c:124)]: recursive dlsym : ompt_start_tool

/usr/local/anaconda3/envs/fine-tuning/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable. warn("The installed version of bitsandbytes was compiled without GPU support. " /usr/local/anaconda3/envs/fine-tuning/lib/python3.11/site-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32 Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Using the WANDB_DISABLED environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none). Setting eos_token is not supported, use the default one. Setting pad_token is not supported, use the default one. Setting unk_token is not supported, use the default one. Loading checkpoint shards: 100%|?????????????????????????????????????????????????????????| 7/7 [00:09<00:00, 1.34s/it] CUDA extension not installed. CUDA extension not installed. trainable params: 1,949,696 || all params: 6,245,533,696 || trainable%: 0.031217444255383614 --> Model

--> model has 1.949696M params

train_dataset: Dataset({ features: ['input_ids', 'labels'], num_rows: 84 }) val_dataset: Dataset({ features: ['input_ids', 'output_ids'], num_rows: 60 }) test_dataset: Dataset({ features: ['input_ids', 'output_ids'], num_rows: 60 }) max_steps is given, it will override any value given in num_train_epochs Running training Num examples = 84 Num Epochs = 6 Instantaneous batch size per device = 1 Total train batch size (w. parallel, distributed & accumulation) = 1 Gradient Accumulation steps = 1 Total optimization steps = 500 Number of trainable parameters = 1,949,696 0%| | 0/500 [00:00<?, ?it/s]

跟着官网教程做的，到这一步就卡主了请问可能是什么原因

We-IOT commented 8 months ago

这一步会慢一点，但是多等等就回过去，貌似没有出错。

xixia03 commented 8 months ago

[大哭]，至少等一小时了，output目录下也没啥日志，数据集就80多条，看gpu使用率一直是0，就一直卡着无处下爪 (fine-tuning) root@cg:/home/jovyan/work/chatglm/ChatGLM3-main/finetune_demo/output/runs/Mar04_10-10-11_cg# ls -l total 7 -rw-r--r--. 1 3000 3000 7136 Mar 4 10:10 events.out.tfevents.1709547023.cg.39432.0 (fine-tuning) root@cg:/home/jovyan/work/chatglm/ChatGLM3-main/finetune_demo/output/runs/Mar04_10-10-11_cg#

xixia03 commented 8 months ago

也许是transformers版本高了，重新搞了个环境用的4.37.0的版本可以正常运行了

We-IOT commented 8 months ago

这样就好，我一般会conda建个环境，按照request.txt的要求安装，遇到意外的情况很少

We-IOT / chatglm3_6b_finetune

求助：Lora微调时在Running training这一步就卡主了，一直是0% #5