axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.84k stars 863 forks source link

RuntimeError: PytorchStreamReader failed reading file data/0: invalid header or archive is corrupted #1156

Open vip-china opened 9 months ago

vip-china commented 9 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Generate the correct Lora after training is completed

Current behaviour

Using the following command to merge models, there is an error message:

python3 -m axolotl.cli.merge_lora sft_34b.yml --lora_model_dir="/workspace/axolotl/output/Yi-34B/ljf-yi-34b-lora" --output_dir=/data1/ljf2/data-check-test image

Steps to reproduce

The meaning of this parameter is not effective save_safetensors: true Actually generated adapter_model.bin

Config yaml

No response

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

winglian commented 9 months ago

is the issue that the merge doesn't work? or that specifying save_safetensors produces a Pytorch_model.bin? or both?

vip-china commented 9 months ago

specifying save_safetensors produces a Pytorch_model.bin

NanoCode012 commented 9 months ago

Using the following command to merge models, there is an error message:

Hey, seems like the PeftModel loading failed. Can you check the files in lora_model_dir are valid (aren't a few KBs)?

Did you run out of space during training?

Since the training had a few checkpoints, could you try pointing the model_dir to one of those and see what happens?

The meaning of this parameter is not effective save_safetensors: true

I'm a bit confused at this as the you said the merge failed.

vip-china commented 9 months ago

Now I am continuing SFT training from checkpoint and reporting this error again

I have configured this parameter: use_reentrant: true resume_from_checkpoint: /workspace/axolotl-main/checkpoint-5865

image image

winglian commented 9 months ago

resuming from a "peft checkpoint" is not the same as resuming from a regular checkpoint. You'll want to set lora_model_dir to point to the checkpoint directory iirc. @NanoCode012 does that sound right?