RuntimeError: !grad_accumulator_.expired() INTERNAL ASSERT FAILED

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Can successfully sft

Current behaviour

Unable to perform SFT training，The following error is reported

Attempted to cancel flash_ ATTN, but the following error was reported again 企业微信截图_17056403198640

Steps to reproduce

config.yml：

base_model: /data1/ljf2/data/Nous-Hermes-2-Mixtral-8x7B-SFT
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
is_llama_derived_model: false
load_in_8bit: false
load_in_4bit: true
strict: false
sequence_len: 4096
bf16: true
fp16: false
tf32: false
flash_attention: true
trust_remote_code: true

model_config:
  output_router_logits: true

# special_tokens:
#   bos_token: "<|startoftext|>"
#   eos_token: "<|im_end|>"
#   unk_token: "<unk>"

#tokens:
#   - "<|im_start|>"

# Data
datasets:
  - path: /data1/ljf2/data/openhermes_1k.json
    type: alpaca
    prompt_style: chatml
dataset_prepared_path: last_run_prepared
warmup_steps: 100

# Iterations
num_epochs: 1

# Evaluation
val_set_size: 0.01
evals_per_epoch: 1
eval_table_size:
eval_table_max_new_tokens: 128
eval_sample_packing: false
eval_batch_size: 1

## You can optionally freeze the entire model and unfreeze a subset of parameters
unfrozen_parameters:
#  - lm_head.*
#  - model.embed_tokens.*
#  - model.layers.2[0-9]+.block_sparse_moe.gate.*
#  - model.layers.2[0-9]+.block_sparse_moe.experts.*
#  - model.layers.3[0-9]+.block_sparse_moe.gate.*
#  - model.layers.3[0-9]+.block_sparse_moe.experts.*

# LoRA
output_dir: /workspace/axolotl/output/Nous-Hermes-2-Mixtral-8x7B-SFT-CyberGPT
adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

lora_modules_to_save:
  - embed_tokens
  - lm_head

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 1
micro_batch_size: 1
gradient_checkpointing: true

# wandb

# Optimizer
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0003

# Misc
train_on_inputs: false
group_by_length: false
save_steps: 0.01
save_total_limit: 2
#save_safetensors: true
early_stopping_patience:
resume_from_checkpoint:
loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3
local_rank:
logging_steps: 1
xformers_attention:
debug:
deepspeed: /data1/ljf2/data/zero3_bf16.json
weight_decay: 0.01
fsdp:
fsdp_config:

Complete error output

this

root@f2e11ed3bbe4:/workspace/axolotl# sh qb.sh The following values were not passed to `accelerate launch` and had defaults used instead: `--num_processes` was set to a value of `8` More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:106: UserWarning: ================================================================================ WARNING: Manual override via BNB_CUDA_VERSION env variable detected! BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version. If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION= If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH: [2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:186] [PID:6549] [RANK:1] BOS: 1 / ~~[2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:187] [PID:6549] [RANK:1] PAD: 2 /~~ [2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:188] [PID:6549] [RANK:1] UNK: 0 / [2024-01-19 08:19:49,064] [INFO] [axolotl.load_tokenizer:193] [PID:6549] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:185] [PID:6555] [RANK:7] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:186] [PID:6555] [RANK:7] BOS: 1 / ~~[2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:187] [PID:6555] [RANK:7] PAD: 2 /~~ [2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:188] [PID:6555] [RANK:7] UNK: 0 / [2024-01-19 08:19:49,070] [INFO] [axolotl.load_tokenizer:193] [PID:6555] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:185] [PID:6554] [RANK:6] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:186] [PID:6554] [RANK:6] BOS: 1 / ~~[2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:187] [PID:6554] [RANK:6] PAD: 2 /~~ [2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:188] [PID:6554] [RANK:6] UNK: 0 / [2024-01-19 08:19:49,071] [INFO] [axolotl.load_tokenizer:193] [PID:6554] [RANK:6] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:185] [PID:6548] [RANK:0] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:186] [PID:6548] [RANK:0] BOS: 1 / ~~[2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:187] [PID:6548] [RANK:0] PAD: 2 /~~ [2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:188] [PID:6548] [RANK:0] UNK: 0 / [2024-01-19 08:19:49,077] [INFO] [axolotl.load_tokenizer:193] [PID:6548] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,078] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6548] [RANK:0] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:49,082] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6548] [RANK:0] Prepared dataset loaded from disk... [2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:185] [PID:6551] [RANK:3] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:186] [PID:6551] [RANK:3] BOS: 1 / ~~[2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:187] [PID:6551] [RANK:3] PAD: 2 /~~ [2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:188] [PID:6551] [RANK:3] UNK: 0 / [2024-01-19 08:19:49,084] [INFO] [axolotl.load_tokenizer:193] [PID:6551] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:185] [PID:6553] [RANK:5] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:186] [PID:6553] [RANK:5] BOS: 1 / ~~[2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:187] [PID:6553] [RANK:5] PAD: 2 /~~ [2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:188] [PID:6553] [RANK:5] UNK: 0 / [2024-01-19 08:19:49,096] [INFO] [axolotl.load_tokenizer:193] [PID:6553] [RANK:5] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:185] [PID:6550] [RANK:2] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:186] [PID:6550] [RANK:2] BOS: 1 / ~~[2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:187] [PID:6550] [RANK:2] PAD: 2 /~~ [2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:188] [PID:6550] [RANK:2] UNK: 0 / [2024-01-19 08:19:49,122] [INFO] [axolotl.load_tokenizer:193] [PID:6550] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:185] [PID:6552] [RANK:4] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:186] [PID:6552] [RANK:4] BOS: 1 / ~~[2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:187] [PID:6552] [RANK:4] PAD: 2 /~~ [2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:188] [PID:6552] [RANK:4] UNK: 0 / [2024-01-19 08:19:49,149] [INFO] [axolotl.load_tokenizer:193] [PID:6552] [RANK:4] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6552] [RANK:4] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6553] [RANK:5] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6551] [RANK:3] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6550] [RANK:2] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6549] [RANK:1] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6554] [RANK:6] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6555] [RANK:7] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,535] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6551] [RANK:3] Prepared dataset loaded from disk... [2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6549] [RANK:1] Prepared dataset loaded from disk... [2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6552] [RANK:4] Prepared dataset loaded from disk... [2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6553] [RANK:5] Prepared dataset loaded from disk... [2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6550] [RANK:2] Prepared dataset loaded from disk... [2024-01-19 08:19:50,537] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6555] [RANK:7] Prepared dataset loaded from disk... [2024-01-19 08:19:50,537] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6554] [RANK:6] Prepared dataset loaded from disk... [2024-01-19 08:19:50,970] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] total_num_tokens: 515000 [2024-01-19 08:19:50,982] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] `total_supervised_tokens: 389502` [2024-01-19 08:19:54,923] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:54,923] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] data_loader_len: 119 [2024-01-19 08:19:55,127] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,134] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,158] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,181] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,230] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,287] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,539] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,569] [INFO] [axolotl.log:60] [PID:6548] [RANK:0] sample_packing_eff_est across ranks: [0.904549777507782, 0.8917192816734314, 0.8980887532234192, 0.904549777507782, 0.8917192816734314, 0.8980887532234192, 0.904549777507782, 0.8980887532234192] [2024-01-19 08:19:55,570] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] sample_packing_eff_est: 0.91 [2024-01-19 08:19:55,570] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] total_num_steps: 14 [2024-01-19 08:19:55,577] [DEBUG] [axolotl.train.log:60] [PID:6548] [RANK:0] loading tokenizer... /data1/ljf2/data/Nous-Hermes-2-Mixtral-8x7B-SFT [2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:185] [PID:6548] [RANK:0] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:186] [PID:6548] [RANK:0] BOS: 1 / ~~[2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:187] [PID:6548] [RANK:0] PAD: 2 /~~ [2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:188] [PID:6548] [RANK:0] UNK: 0 / [2024-01-19 08:19:55,628] [INFO] [axolotl.load_tokenizer:193] [PID:6548] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,629] [DEBUG] [axolotl.train.log:60] [PID:6548] [RANK:0] loading model and peft_config... [2024-01-19 08:19:55,637] [INFO] [axolotl.load_model:264] [PID:6548] [RANK:0] patching with flash attention [2024-01-19 08:19:55,637] [INFO] [axolotl.load_model:276] [PID:6548] [RANK:0] patching with flash attention [2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:185] [PID:6552] [RANK:4] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:186] [PID:6552] [RANK:4] BOS: 1 / ~~[2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:187] [PID:6552] [RANK:4] PAD: 2 /~~ [2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:188] [PID:6552] [RANK:4] UNK: 0 / [2024-01-19 08:19:55,638] [INFO] [axolotl.load_tokenizer:193] [PID:6552] [RANK:4] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:185] [PID:6549] [RANK:1] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6549] [RANK:1] BOS: 1 / ~~[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6549] [RANK:1] PAD: 2 /~~ [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:188] [PID:6549] [RANK:1] UNK: 0 / [2024-01-19 08:19:55,639] [INFO] [axolotl.load_tokenizer:193] [PID:6549] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:185] [PID:6555] [RANK:7] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6555] [RANK:7] BOS: 1 / ~~[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:185] [PID:6553] [RANK:5] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6555] [RANK:7] PAD: 2 /~~ [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6553] [RANK:5] BOS: 1 / ~~[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:188] [PID:6555] [RANK:7] UNK: 0 / [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6553] [RANK:5] PAD: 2 /~~ [2024-01-19 08:19:55,639] [INFO] [axolotl.load_tokenizer:193] [PID:6555] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:188] [PID:6553] [RANK:5] UNK: 0 / [2024-01-19 08:19:55,640] [INFO] [axolotl.load_tokenizer:193] [PID:6553] [RANK:5] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:185] [PID:6551] [RANK:3] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:186] [PID:6551] [RANK:3] BOS: 1 / ~~[2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:187] [PID:6551] [RANK:3] PAD: 2 /~~ [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:188] [PID:6551] [RANK:3] UNK: 0 / [2024-01-19 08:19:55,640] [INFO] [axolotl.load_tokenizer:193] [PID:6551] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,641] [DEBUG] [axolotl.load_tokenizer:185] [PID:6550] [RANK:2] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:186] [PID:6550] [RANK:2] BOS: 1 / ~~[2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:187] [PID:6550] [RANK:2] PAD: 2 /~~ [2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:188] [PID:6550] [RANK:2] UNK: 0 / [2024-01-19 08:19:55,642] [INFO] [axolotl.load_tokenizer:193] [PID:6550] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,646] [INFO] [axolotl.load_model:264] [PID:6549] [RANK:1] patching with flash attention [2024-01-19 08:19:55,647] [INFO] [axolotl.load_model:276] [PID:6549] [RANK:1] patching with flash attention [2024-01-19 08:19:55,647] [INFO] [axolotl.load_model:264] [PID:6552] [RANK:4] patching with flash attention [2024-01-19 08:19:55,647] [INFO] [axolotl.load_model:276] [PID:6552] [RANK:4] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:264] [PID:6553] [RANK:5] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:264] [PID:6555] [RANK:7] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:264] [PID:6551] [RANK:3] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:276] [PID:6553] [RANK:5] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:276] [PID:6555] [RANK:7] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:276] [PID:6551] [RANK:3] patching with flash attention [2024-01-19 08:19:55,650] [INFO] [axolotl.load_model:264] [PID:6550] [RANK:2] patching with flash attention [2024-01-19 08:19:55,650] [INFO] [axolotl.load_model:276] [PID:6550] [RANK:2] patching with flash attention [2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:185] [PID:6554] [RANK:6] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:186] [PID:6554] [RANK:6] BOS: 1 / ~~[2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:187] [PID:6554] [RANK:6] PAD: 2 /~~ [2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:188] [PID:6554] [RANK:6] UNK: 0 / [2024-01-19 08:19:55,666] [INFO] [axolotl.load_tokenizer:193] [PID:6554] [RANK:6] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,680] [INFO] [axolotl.load_model:264] [PID:6554] [RANK:6] patching with flash attention [2024-01-19 08:19:55,681] [INFO] [axolotl.load_model:276] [PID:6554] [RANK:6] patching with flash attention Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:51<00:00, 5.86s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:51<00:00, 5.88s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:51<00:00, 5.88s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.90s/it] [2024-01-19 08:21:53,674] [INFO] [axolotl.load_model:558] [PID:6554] [RANK:6] GPU memory usage after model load: 23.333GB (+0.636GB cache, +1.045GB misc) [2024-01-19 08:21:53,680] [INFO] [axolotl.load_model:581] [PID:6554] [RANK:6] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:53,696] [INFO] [axolotl.load_model:593] [PID:6554] [RANK:6] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:53,702] [INFO] [axolotl.load_lora:698] [PID:6554] [RANK:6] found linear modules: ['k_proj', 'q_proj', 'o_proj', 'w3', 'v_proj', 'gate', 'w2', 'w1'] [2024-01-19 08:21:53,730] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6554] CUDA extension not installed. [2024-01-19 08:21:53,730] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6554] CUDA extension not installed. [2024-01-19 08:21:53,964] [INFO] [axolotl.load_model:558] [PID:6555] [RANK:7] GPU memory usage after model load: 23.333GB (+0.779GB cache, +1.006GB misc) Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.93s/it] [2024-01-19 08:21:53,970] [INFO] [axolotl.load_model:581] [PID:6555] [RANK:7] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:53,986] [INFO] [axolotl.load_model:593] [PID:6555] [RANK:7] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:53,992] [INFO] [axolotl.load_lora:698] [PID:6555] [RANK:7] found linear modules: ['w1', 'w2', 'v_proj', 'k_proj', 'w3', 'o_proj', 'gate', 'q_proj'] [2024-01-19 08:21:54,018] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6555] CUDA extension not installed. [2024-01-19 08:21:54,019] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6555] CUDA extension not installed. [2024-01-19 08:21:54,019] [INFO] [axolotl.load_model:558] [PID:6552] [RANK:4] GPU memory usage after model load: 23.333GB (+0.603GB cache, +1.045GB misc) [2024-01-19 08:21:54,026] [INFO] [axolotl.load_model:581] [PID:6552] [RANK:4] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:54,042] [INFO] [axolotl.load_model:593] [PID:6552] [RANK:4] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:54,047] [INFO] [axolotl.load_lora:698] [PID:6552] [RANK:4] found linear modules: ['v_proj', 'k_proj', 'gate', 'w1', 'q_proj', 'w3', 'w2', 'o_proj'] [2024-01-19 08:21:54,075] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6552] CUDA extension not installed. [2024-01-19 08:21:54,075] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6552] CUDA extension not installed. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.94s/it] [2024-01-19 08:21:54,361] [INFO] [axolotl.load_model:558] [PID:6553] [RANK:5] GPU memory usage after model load: 23.333GB (+0.669GB cache, +1.045GB misc) [2024-01-19 08:21:54,368] [INFO] [axolotl.load_model:581] [PID:6553] [RANK:5] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:54,386] [INFO] [axolotl.load_model:593] [PID:6553] [RANK:5] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:54,393] [INFO] [axolotl.load_lora:698] [PID:6553] [RANK:5] found linear modules: ['k_proj', 'q_proj', 'w1', 'o_proj', 'gate', 'v_proj', 'w2', 'w3'] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.94s/it] [2024-01-19 08:21:54,421] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6553] CUDA extension not installed. [2024-01-19 08:21:54,421] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6553] CUDA extension not installed. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:53<00:00, 5.97s/it] [2024-01-19 08:21:54,772] [INFO] [axolotl.load_model:558] [PID:6548] [RANK:0] GPU memory usage after model load: 23.333GB (+0.817GB cache, +1.162GB misc) [2024-01-19 08:21:54,779] [INFO] [axolotl.load_model:581] [PID:6548] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:54,795] [INFO] [axolotl.load_model:593] [PID:6548] [RANK:0] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:54,801] [INFO] [axolotl.load_lora:698] [PID:6548] [RANK:0] found linear modules: ['v_proj', 'w3', 'q_proj', 'k_proj', 'o_proj', 'w1', 'w2', 'gate'] [2024-01-19 08:21:54,827] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6548] CUDA extension not installed. [2024-01-19 08:21:54,827] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6548] CUDA extension not installed. [2024-01-19 08:21:55,145] [INFO] [axolotl.load_model:558] [PID:6550] [RANK:2] GPU memory usage after model load: 23.333GB (+0.722GB cache, +1.045GB misc) [2024-01-19 08:21:55,151] [INFO] [axolotl.load_model:581] [PID:6550] [RANK:2] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:55,167] [INFO] [axolotl.load_model:593] [PID:6550] [RANK:2] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:55,173] [INFO] [axolotl.load_lora:698] [PID:6550] [RANK:2] found linear modules: ['v_proj', 'q_proj', 'gate', 'k_proj', 'w3', 'w1', 'w2', 'o_proj'] [2024-01-19 08:21:55,201] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6550] CUDA extension not installed. [2024-01-19 08:21:55,201] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6550] CUDA extension not installed. [2024-01-19 08:21:55,259] [INFO] [axolotl.load_model:558] [PID:6551] [RANK:3] GPU memory usage after model load: 23.333GB (+0.706GB cache, +1.045GB misc) [2024-01-19 08:21:55,267] [INFO] [axolotl.load_model:581] [PID:6551] [RANK:3] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:55,284] [INFO] [axolotl.load_model:593] [PID:6551] [RANK:3] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:55,291] [INFO] [axolotl.load_lora:698] [PID:6551] [RANK:3] found linear modules: ['v_proj', 'w3', 'o_proj', 'k_proj', 'w1', 'gate', 'q_proj', 'w2'] [2024-01-19 08:21:55,321] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6551] CUDA extension not installed. [2024-01-19 08:21:55,321] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6551] CUDA extension not installed. [2024-01-19 08:21:55,519] [INFO] [axolotl.load_model:558] [PID:6549] [RANK:1] GPU memory usage after model load: 23.333GB (+0.595GB cache, +1.006GB misc) [2024-01-19 08:21:55,526] [INFO] [axolotl.load_model:581] [PID:6549] [RANK:1] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:55,542] [INFO] [axolotl.load_model:593] [PID:6549] [RANK:1] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:55,548] [INFO] [axolotl.load_lora:698] [PID:6549] [RANK:1] found linear modules: ['gate', 'w2', 'q_proj', 'w3', 'w1', 'v_proj', 'o_proj', 'k_proj'] [2024-01-19 08:21:55,575] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6549] CUDA extension not installed. [2024-01-19 08:21:55,576] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6549] CUDA extension not installed. trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:21:58,941] [INFO] [axolotl.load_model:625] [PID:6554] [RANK:6] GPU memory usage after adapters: 25.703GB (+0.062GB cache, +1.045GB misc) trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:21:59,286] [INFO] [axolotl.load_model:625] [PID:6555] [RANK:7] GPU memory usage after adapters: 25.704GB (+0.067GB cache, +1.006GB misc) [2024-01-19 08:21:59,297] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,297] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,298] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,299] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,322] [INFO] [axolotl.load_model:625] [PID:6552] [RANK:4] GPU memory usage after adapters: 25.694GB (+0.078GB cache, +1.045GB misc) [2024-01-19 08:21:59,644] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,645] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,646] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,646] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,662] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,663] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,664] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,664] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:21:59,758] [INFO] [axolotl.load_model:625] [PID:6553] [RANK:5] GPU memory usage after adapters: 25.701GB (+0.079GB cache, +1.045GB misc) trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:22:00,088] [INFO] [axolotl.load_model:625] [PID:6548] [RANK:0] GPU memory usage after adapters: 25.701GB (+0.071GB cache, +1.162GB misc) [2024-01-19 08:22:00,097] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,098] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,099] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,100] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,133] [INFO] [axolotl.train.log:60] [PID:6548] [RANK:0] Pre-saving adapter config to /workspace/axolotl/output/Nous-Hermes-2-Mixtral-8x7B-SFT-CyberGPT [2024-01-19 08:22:00,136] [INFO] [axolotl.train.log:60] [PID:6548] [RANK:0] Starting trainer... [2024-01-19 08:22:00,464] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,465] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,466] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,466] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:22:00,602] [INFO] [axolotl.load_model:625] [PID:6551] [RANK:3] GPU memory usage after adapters: 25.705GB (+0.072GB cache, +1.045GB misc) trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:22:00,712] [INFO] [axolotl.load_model:625] [PID:6550] [RANK:2] GPU memory usage after adapters: 25.699GB (+0.075GB cache, +1.045GB misc) [2024-01-19 08:22:00,938] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,939] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,940] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,940] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:22:01,054] [INFO] [axolotl.load_model:625] [PID:6549] [RANK:1] GPU memory usage after adapters: 25.695GB (+0.069GB cache, +1.006GB misc) [2024-01-19 08:22:01,075] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,076] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,076] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,077] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,762] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,763] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,763] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,764] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.08674335479736328 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10159158706665039 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.10232019424438477 seconds Time to load fused_adam op: 0.10149574279785156 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10169720649719238 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.10174274444580078 seconds Time to load fused_adam op: 0.10324215888977051 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10189294815063477 seconds Parameter Offload: Total persistent parameters: 2895872 in 193 params [2024-01-19 08:22:14,416] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,417] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,432] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,433] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,435] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,436] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,438] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,439] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,455] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,456] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,456] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,457] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,467] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,469] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 0%| | 0/16 [00:00 fire.Fire(do_cli) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in do_cli train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/workspace/axolotl/src/axolotl/train.py", line 142, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1543, in train return inner_training_loop( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2746, in training_step self.accelerator.backward(loss) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1983, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1955, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 279, in backward q, k, v, out, softmax_lse, cu_seqlens, rng_state = ctx.saved_tensors RuntimeError: !grad_accumulator_.expired() INTERNAL ASSERT FAILED at "../torch/csrc/autograd/saved_variable.cpp":226, please report a bug to PyTorch. No grad accumulator for a saved leaf

deepspeed zero3 config.json


{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"last_batch_iteration": -1,
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 2e9,
"stage3_max_reuse_distance": 2e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

### Config yaml

_No response_

### Possible solution

_No response_

### Which Operating Systems are you using?

- [X] Linux
- [ ] macOS
- [ ] Windows

### Python Version

3.10

### axolotl branch-commit

main

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

axolotl-ai-cloud / axolotl