axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.87k stars 866 forks source link

Fail to save last checkpoint #1613

Closed Nero10578 closed 5 months ago

Nero10578 commented 6 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Expected behavior is to save the last checkpoint like the previous intermediate checkpoints. It has no failed to save the final checkpoint multiple times. I am running this on Ubuntu WSL2 in Windows 11.

Current behaviour

At the end of a training run, it will not save the last checkpoint.

{'loss': 0.3905, 'grad_norm': 0.240234375, 'learning_rate': 2.0071391760856373e-10, 'epoch': 2.0}
{'loss': 0.3887, 'grad_norm': 0.259765625, 'learning_rate': 0.0, 'epoch': 2.0}
{'train_runtime': 23410.1952, 'train_samples_per_second': 16.764, 'train_steps_per_second': 0.067, 'train_loss': 0.09891741436697231, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████| 1578/1578 [6:30:10<00:00, 14.84s/it]
[2024-05-13 14:47:28,053] [INFO] [axolotl.train.log:61] [PID:123985] [RANK:0] Training Completed!!! Saving pre-trained model to ./qlora-out
/home/owen/miniconda3/envs/axolotl/lib/python3.10/site-packages/peft/utils/save_and_load.py:154: UserWarning: Could not find a config file in /home/owen/models/Meta-Llama-3-8B-Instruct - will assume that the vocabulary was not modified.

Nothing wrong seems to happen as shown.

Steps to reproduce

Just run any training run both SFT or DPO both I've tried failed to save the last checkpoint. Not sure if there is something wrong in my config yaml for the train or a bug on Axolotl.

I've tried enabling and also disabling wandb since that caused this issue sometime a few months ago as well. This time it made no difference.

Config yaml

base_model: /home/owen/models/Meta-Llama-3-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

train_on_inputs: false
group_by_length: false
load_in_8bit: false
load_in_4bit: true
strict: false
sequence_len: 2048
bf16: true
fp16: false
tf32: false
flash_attention: true

# Data
datasets:
  - path: /home/owen/datasets/no-robots-sharegpt.jsonl
    type: sharegpt
    conversation: llama-3
  - path: /home/owen/datasets/fixed-dolphin201-sharegpt2.jsonl
    type: sharegpt
    conversation: llama-3
  - path: /home/owen/datasets/cleaned-WizardLM_alpaca_evol_instruct_70k.jsonl
    type: sharegpt
    conversation: llama-3

warmup_steps: 10
dataset_prepared_path: ./last_run_prepared

# Iterations
num_epochs: 2
saves_per_epoch: 2

# Evaluation
val_set_size: 0.01
eval_table_size:
eval_table_max_new_tokens:
eval_sample_packing: false
evals_per_epoch: 4

# LoRA
output_dir: ./qlora-out
adapter: qlora
lora_model_dir:
lora_r: 64
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
save_safetensors: true

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 32
micro_batch_size: 2
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: true

# wandb
wandb_mode: disabled

# Optimizer
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

# Misc
early_stopping_patience:
auto_resume_from_checkpoints: true
logging_steps: 1
debug:
deepspeed:
weight_decay: 0.1
special_tokens:
  eos_token: "<|eot_id|>"
  pad_token: "<|end_of_text|>"

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.11.9

axolotl branch-commit

2147cf6837e2b90a2ea7045262083cbb0da03858

Acknowledgements

winglian commented 6 months ago

hi @Nero10578 , can you verify for me what checkpoint step numbers it did save on as well as the total number of steps in the training? thanks!

Nero10578 commented 6 months ago

hi @Nero10578 , can you verify for me what checkpoint step numbers it did save on as well as the total number of steps in the training? thanks!

It saved on these checkpoints: Screenshot 2024-05-13 231731

There should be 1578 number of steps as can be seen here:

{'train_runtime': 23470.402, 'train_samples_per_second': 16.721, 'train_steps_per_second': 0.067, 'train_loss': 0.09891688005930269, 'epoch': 2.0}
100%|█████████████████████████████████████████████████████████████████████████████| 1578/1578 [6:31:06<00:00, 14.87s/it]
[2024-05-13 22:54:29,715] [INFO] [axolotl.train.log:61] [PID:802] [RANK:0] Training Completed!!! Saving pre-trained model to ./qlora-out
wandb:
wandb: Run history:
wandb:               eval/loss ▁█
wandb:            eval/runtime ▁█
wandb: eval/samples_per_second █▁
wandb:   eval/steps_per_second █▁
wandb:             train/epoch ▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▆▇▇▇▇▇███
wandb:       train/global_step ▁▁▁▂▂▂▂▂▂▃▃▃▃▃▄▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
wandb:         train/grad_norm ▁▅▃█▃▃▄▂▂▃▃▂▃▆▄▃▂▅▃▃▃▂▄█▅▄▄▃▁▂▄▂▄▃▃▃▃▄▂▂
wandb:     train/learning_rate ██▇▇▇▆▆▆▆▅▅▅▅▄▄▄▄▃▃▃▃▃▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁
wandb:              train/loss ▇▃▆▅▄▄▅▄▅▄▂▅▄▆▃▅▄▅▅▅▅▃▄▄█▃▄▃▃▃▃▅▁▄▄▆▇▄▆▄
wandb:
wandb: Run summary:
wandb:                eval/loss 0.56355
wandb:             eval/runtime 563.5022
wandb:  eval/samples_per_second 3.519
wandb:    eval/steps_per_second 1.76
wandb:               total_flos 1.01735945815311e+19
wandb:              train/epoch 2.0
wandb:        train/global_step 1578
wandb:          train/grad_norm 0.26172
wandb:      train/learning_rate 0.0
wandb:               train/loss 0.3887
wandb:               train_loss 0.09892
wandb:            train_runtime 23470.402
wandb: train_samples_per_second 16.721
wandb:   train_steps_per_second 0.067

I've tried running it again to see if its a fluke and no its still failing to save at the end. I've tried with a test super short dataset and it saves fine otherwise. Is there something wrong with my set number of steps? It say this when resuming:

[2024-05-13 23:20:40,426] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:100043] [RANK:0] packing_efficiency_estimate: 0.93 total_num_tokens per device: 97218015
[2024-05-13 23:20:40,583] [INFO] [axolotl.utils.samplers.multipack._len_est:184] [PID:100043] [RANK:0] packing_efficiency_estimate: 0.93 total_num_tokens per device: 97218015
Warning: The training argument 'eval_steps' value (0.125) does not match the trainer state 'eval_steps' value (198). This argument will be overridden by the one found in trainer_state.json within the checkpoint directory.
Warning: The training argument 'save_steps' value (0.25) does not match the trainer state 'save_steps' value (395). This argument will be overridden by the one found in trainer_state.json within the checkpoint directory.
winglian commented 5 months ago

it looks like the reason it didn't save the last step is that it is saving every 395 steps, so that means the next step it would save at is 1580, but your last step is 1548. Let me see if there is a good way to workaround that.

winglian commented 5 months ago

@Nero10578 Fixed in #1615

Nero10578 commented 5 months ago

it looks like the reason it didn't save the last step is that it is saving every 395 steps, so that means the next step it would save at is 1580, but your last step is 1548. Let me see if there is a good way to workaround that.

Awesome fix! Thank you for all your work on this! So essentially this was just a problem with odd saving steps? Explains why it only happens sometimes.

winglian commented 5 months ago

yeah, what's happening is you have 1578 steps, and it saves 4 times, so 1578 * 0.25 = 394.5. Since it uses ceiling, it saves every 395 steps, and misses the last step. It might be worth raising an upstream issue with HF transformers for this to math.floor instead.

Screenshot 2024-05-14 at 9 35 24 AM
Nero10578 commented 5 months ago

yeah, what's happening is you have 1578 steps, and it saves 4 times, so 1578 * 0.25 = 394.5. Since it uses ceiling, it saves every 395 steps, and misses the last step. It might be worth raising an upstream issue with HF transformers for this to math.floor instead.

Screenshot 2024-05-14 at 9 35 24 AM

Ah I see okay. Thanks for explaining that.