axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.85k stars 864 forks source link

Model saving issue after training #1842

Open gothaleshubham opened 2 months ago

gothaleshubham commented 2 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Expected behaviour is to save model weight as in output directory

Current behaviour

Only saving tokenizer information in output directory and model weight are not even after training is completed. I am using 2 gpus for training the model. While saving and I see the utilization on gpu2 as shown in below. But not saving weights and stuck Screenshot from 2024-08-21 10-36-41 error.log

Screenshot from 2024-08-21 10-45-09

Steps to reproduce

Data : train_data_1K.json

git clone https://github.com/axolotl-ai-cloud/axolotl cd axolotl

pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]' cd .. accelerate launch -m axolotl.cli.train model.yaml

Config yaml

base_model: gpt2
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

# huggingface repo
datasets:
  - path: train_data_1K.jsonl
    ds_type: json
    type: chat_template
    chat_template: gemma
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    roles:
      user:
        - human
      assistant:
        - gpt
dataset_prepared_path: last_run_prepared
val_set_size: 0.5
output_dir: output

sequence_len: 2048
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 100
evals_per_epoch: 1
eval_table_size:
debug:
weight_decay: 0.0

special_tokens:

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

winglian commented 2 months ago

a couple of things. I would recommend disabling sample packing with the gpt2 model as it does not really support it wince it doesn't have flash attention support. Also the max sequence length for gpt2 is 1024. I noticed when I tried it with the setting of 2048 you are using, it ran into a cuda/nccl issue which is likely what you were seeing.

I think once you make these changes, it should be fine, as it saved properly for me after that.