Model saving issue after training

Please check that this issue hasn't been reported before.

[ ] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Expected behaviour is to save model weight as in output directory

Current behaviour

Only saving tokenizer information in output directory and model weight are not even after training is completed. I am using 2 gpus for training the model. While saving and I see the utilization on gpu2 as shown in below. But not saving weights and stuck Screenshot from 2024-08-21 10-36-41 error.log

Screenshot from 2024-08-21 10-45-09

Steps to reproduce

Data : train_data_1K.json

git clone https://github.com/axolotl-ai-cloud/axolotl cd axolotl

pip3 install packaging ninja pip3 install -e '.[flash-attn,deepspeed]' cd .. accelerate launch -m axolotl.cli.train model.yaml

Config yaml

base_model: gpt2
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

load_in_8bit: false
load_in_4bit: false
strict: false

# huggingface repo
datasets:
  - path: train_data_1K.jsonl
    ds_type: json
    type: chat_template
    chat_template: gemma
    field_messages: conversations
    message_field_role: from
    message_field_content: value
    roles:
      user:
        - human
      assistant:
        - gpt
dataset_prepared_path: last_run_prepared
val_set_size: 0.5
output_dir: output

sequence_len: 2048
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true

wandb_project:
wandb_entity:
wandb_watch:
wandb_name:
wandb_log_model:

gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 2e-5

train_on_inputs: false
group_by_length: false
bf16: auto
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
flash_attention: false

warmup_steps: 100
evals_per_epoch: 1
eval_table_size:
debug:
weight_decay: 0.0

special_tokens:

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

axolotl-ai-cloud / axolotl