axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.58k stars 822 forks source link

RuntimeError: !grad_accumulator_.expired() INTERNAL ASSERT FAILED #1153

Open vip-china opened 8 months ago

vip-china commented 8 months ago

Please check that this issue hasn't been reported before.

Expected Behavior

Can successfully sft

Current behaviour

Unable to perform SFT training,The following error is reported image

Attempted to cancel flash_ ATTN, but the following error was reported again 企业微信截图_17056403198640

Steps to reproduce

config.yml:

base_model: /data1/ljf2/data/Nous-Hermes-2-Mixtral-8x7B-SFT
model_type: AutoModelForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true
is_llama_derived_model: false
load_in_8bit: false
load_in_4bit: true
strict: false
sequence_len: 4096
bf16: true
fp16: false
tf32: false
flash_attention: true
trust_remote_code: true

model_config:
  output_router_logits: true

# special_tokens:
#   bos_token: "<|startoftext|>"
#   eos_token: "<|im_end|>"
#   unk_token: "<unk>"

#tokens:
#   - "<|im_start|>"

# Data
datasets:
  - path: /data1/ljf2/data/openhermes_1k.json
    type: alpaca
    prompt_style: chatml
dataset_prepared_path: last_run_prepared
warmup_steps: 100

# Iterations
num_epochs: 1

# Evaluation
val_set_size: 0.01
evals_per_epoch: 1
eval_table_size:
eval_table_max_new_tokens: 128
eval_sample_packing: false
eval_batch_size: 1

## You can optionally freeze the entire model and unfreeze a subset of parameters
unfrozen_parameters:
#  - lm_head.*
#  - model.embed_tokens.*
#  - model.layers.2[0-9]+.block_sparse_moe.gate.*
#  - model.layers.2[0-9]+.block_sparse_moe.experts.*
#  - model.layers.3[0-9]+.block_sparse_moe.gate.*
#  - model.layers.3[0-9]+.block_sparse_moe.experts.*

# LoRA
output_dir: /workspace/axolotl/output/Nous-Hermes-2-Mixtral-8x7B-SFT-CyberGPT
adapter: qlora
lora_model_dir:
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
  - gate_proj
  - down_proj
  - up_proj
  - q_proj
  - v_proj
  - k_proj
  - o_proj

lora_modules_to_save:
  - embed_tokens
  - lm_head

# Sampling
sample_packing: true
pad_to_sequence_len: true

# Batching
gradient_accumulation_steps: 1
micro_batch_size: 1
gradient_checkpointing: true

# wandb

# Optimizer
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.0003

# Misc
train_on_inputs: false
group_by_length: false
save_steps: 0.01
save_total_limit: 2
#save_safetensors: true
early_stopping_patience:
resume_from_checkpoint:
loss_watchdog_threshold: 5.0
loss_watchdog_patience: 3
local_rank:
logging_steps: 1
xformers_attention:
debug:
deepspeed: /data1/ljf2/data/zero3_bf16.json
weight_decay: 0.01
fsdp:
fsdp_config:

Complete error output

this root@f2e11ed3bbe4:/workspace/axolotl# sh qb.sh The following values were not passed to `accelerate launch` and had defaults used instead: `--num_processes` was set to a value of `8` More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:106: UserWarning: ================================================================================ WARNING: Manual override via BNB_CUDA_VERSION env variable detected! BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version. If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION= If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH: [2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:186] [PID:6549] [RANK:1] BOS: 1 / [2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:187] [PID:6549] [RANK:1] PAD: 2 / [2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:188] [PID:6549] [RANK:1] UNK: 0 / [2024-01-19 08:19:49,064] [INFO] [axolotl.load_tokenizer:193] [PID:6549] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:185] [PID:6555] [RANK:7] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:186] [PID:6555] [RANK:7] BOS: 1 / [2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:187] [PID:6555] [RANK:7] PAD: 2 / [2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:188] [PID:6555] [RANK:7] UNK: 0 / [2024-01-19 08:19:49,070] [INFO] [axolotl.load_tokenizer:193] [PID:6555] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:185] [PID:6554] [RANK:6] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:186] [PID:6554] [RANK:6] BOS: 1 / [2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:187] [PID:6554] [RANK:6] PAD: 2 / [2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:188] [PID:6554] [RANK:6] UNK: 0 / [2024-01-19 08:19:49,071] [INFO] [axolotl.load_tokenizer:193] [PID:6554] [RANK:6] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:185] [PID:6548] [RANK:0] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:186] [PID:6548] [RANK:0] BOS: 1 / [2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:187] [PID:6548] [RANK:0] PAD: 2 / [2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:188] [PID:6548] [RANK:0] UNK: 0 / [2024-01-19 08:19:49,077] [INFO] [axolotl.load_tokenizer:193] [PID:6548] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,078] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6548] [RANK:0] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:49,082] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6548] [RANK:0] Prepared dataset loaded from disk... [2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:185] [PID:6551] [RANK:3] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:186] [PID:6551] [RANK:3] BOS: 1 / [2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:187] [PID:6551] [RANK:3] PAD: 2 / [2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:188] [PID:6551] [RANK:3] UNK: 0 / [2024-01-19 08:19:49,084] [INFO] [axolotl.load_tokenizer:193] [PID:6551] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:185] [PID:6553] [RANK:5] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:186] [PID:6553] [RANK:5] BOS: 1 / [2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:187] [PID:6553] [RANK:5] PAD: 2 / [2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:188] [PID:6553] [RANK:5] UNK: 0 / [2024-01-19 08:19:49,096] [INFO] [axolotl.load_tokenizer:193] [PID:6553] [RANK:5] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:185] [PID:6550] [RANK:2] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:186] [PID:6550] [RANK:2] BOS: 1 / [2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:187] [PID:6550] [RANK:2] PAD: 2 / [2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:188] [PID:6550] [RANK:2] UNK: 0 / [2024-01-19 08:19:49,122] [INFO] [axolotl.load_tokenizer:193] [PID:6550] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:185] [PID:6552] [RANK:4] EOS: 32000 / <|im_end|> [2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:186] [PID:6552] [RANK:4] BOS: 1 / [2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:187] [PID:6552] [RANK:4] PAD: 2 / [2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:188] [PID:6552] [RANK:4] UNK: 0 / [2024-01-19 08:19:49,149] [INFO] [axolotl.load_tokenizer:193] [PID:6552] [RANK:4] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6552] [RANK:4] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6553] [RANK:5] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6551] [RANK:3] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6550] [RANK:2] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6549] [RANK:1] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6554] [RANK:6] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6555] [RANK:7] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54... [2024-01-19 08:19:50,535] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6551] [RANK:3] Prepared dataset loaded from disk... [2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6549] [RANK:1] Prepared dataset loaded from disk... [2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6552] [RANK:4] Prepared dataset loaded from disk... [2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6553] [RANK:5] Prepared dataset loaded from disk... [2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6550] [RANK:2] Prepared dataset loaded from disk... [2024-01-19 08:19:50,537] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6555] [RANK:7] Prepared dataset loaded from disk... [2024-01-19 08:19:50,537] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6554] [RANK:6] Prepared dataset loaded from disk... [2024-01-19 08:19:50,970] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] total_num_tokens: 515000 [2024-01-19 08:19:50,982] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] `total_supervised_tokens: 389502` [2024-01-19 08:19:54,923] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:54,923] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] data_loader_len: 119 [2024-01-19 08:19:55,127] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,134] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,158] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,181] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,230] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,287] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,539] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375 [2024-01-19 08:19:55,569] [INFO] [axolotl.log:60] [PID:6548] [RANK:0] sample_packing_eff_est across ranks: [0.904549777507782, 0.8917192816734314, 0.8980887532234192, 0.904549777507782, 0.8917192816734314, 0.8980887532234192, 0.904549777507782, 0.8980887532234192] [2024-01-19 08:19:55,570] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] sample_packing_eff_est: 0.91 [2024-01-19 08:19:55,570] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] total_num_steps: 14 [2024-01-19 08:19:55,577] [DEBUG] [axolotl.train.log:60] [PID:6548] [RANK:0] loading tokenizer... /data1/ljf2/data/Nous-Hermes-2-Mixtral-8x7B-SFT [2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:185] [PID:6548] [RANK:0] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:186] [PID:6548] [RANK:0] BOS: 1 / [2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:187] [PID:6548] [RANK:0] PAD: 2 / [2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:188] [PID:6548] [RANK:0] UNK: 0 / [2024-01-19 08:19:55,628] [INFO] [axolotl.load_tokenizer:193] [PID:6548] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,629] [DEBUG] [axolotl.train.log:60] [PID:6548] [RANK:0] loading model and peft_config... [2024-01-19 08:19:55,637] [INFO] [axolotl.load_model:264] [PID:6548] [RANK:0] patching with flash attention [2024-01-19 08:19:55,637] [INFO] [axolotl.load_model:276] [PID:6548] [RANK:0] patching with flash attention [2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:185] [PID:6552] [RANK:4] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:186] [PID:6552] [RANK:4] BOS: 1 / [2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:187] [PID:6552] [RANK:4] PAD: 2 / [2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:188] [PID:6552] [RANK:4] UNK: 0 / [2024-01-19 08:19:55,638] [INFO] [axolotl.load_tokenizer:193] [PID:6552] [RANK:4] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:185] [PID:6549] [RANK:1] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6549] [RANK:1] BOS: 1 / [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6549] [RANK:1] PAD: 2 / [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:188] [PID:6549] [RANK:1] UNK: 0 / [2024-01-19 08:19:55,639] [INFO] [axolotl.load_tokenizer:193] [PID:6549] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:185] [PID:6555] [RANK:7] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6555] [RANK:7] BOS: 1 / [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:185] [PID:6553] [RANK:5] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6555] [RANK:7] PAD: 2 / [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6553] [RANK:5] BOS: 1 / [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:188] [PID:6555] [RANK:7] UNK: 0 / [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6553] [RANK:5] PAD: 2 / [2024-01-19 08:19:55,639] [INFO] [axolotl.load_tokenizer:193] [PID:6555] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:188] [PID:6553] [RANK:5] UNK: 0 / [2024-01-19 08:19:55,640] [INFO] [axolotl.load_tokenizer:193] [PID:6553] [RANK:5] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:185] [PID:6551] [RANK:3] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:186] [PID:6551] [RANK:3] BOS: 1 / [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:187] [PID:6551] [RANK:3] PAD: 2 / [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:188] [PID:6551] [RANK:3] UNK: 0 / [2024-01-19 08:19:55,640] [INFO] [axolotl.load_tokenizer:193] [PID:6551] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,641] [DEBUG] [axolotl.load_tokenizer:185] [PID:6550] [RANK:2] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:186] [PID:6550] [RANK:2] BOS: 1 / [2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:187] [PID:6550] [RANK:2] PAD: 2 / [2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:188] [PID:6550] [RANK:2] UNK: 0 / [2024-01-19 08:19:55,642] [INFO] [axolotl.load_tokenizer:193] [PID:6550] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,646] [INFO] [axolotl.load_model:264] [PID:6549] [RANK:1] patching with flash attention [2024-01-19 08:19:55,647] [INFO] [axolotl.load_model:276] [PID:6549] [RANK:1] patching with flash attention [2024-01-19 08:19:55,647] [INFO] [axolotl.load_model:264] [PID:6552] [RANK:4] patching with flash attention [2024-01-19 08:19:55,647] [INFO] [axolotl.load_model:276] [PID:6552] [RANK:4] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:264] [PID:6553] [RANK:5] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:264] [PID:6555] [RANK:7] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:264] [PID:6551] [RANK:3] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:276] [PID:6553] [RANK:5] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:276] [PID:6555] [RANK:7] patching with flash attention [2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:276] [PID:6551] [RANK:3] patching with flash attention [2024-01-19 08:19:55,650] [INFO] [axolotl.load_model:264] [PID:6550] [RANK:2] patching with flash attention [2024-01-19 08:19:55,650] [INFO] [axolotl.load_model:276] [PID:6550] [RANK:2] patching with flash attention [2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:185] [PID:6554] [RANK:6] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:186] [PID:6554] [RANK:6] BOS: 1 / [2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:187] [PID:6554] [RANK:6] PAD: 2 / [2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:188] [PID:6554] [RANK:6] UNK: 0 / [2024-01-19 08:19:55,666] [INFO] [axolotl.load_tokenizer:193] [PID:6554] [RANK:6] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,680] [INFO] [axolotl.load_model:264] [PID:6554] [RANK:6] patching with flash attention [2024-01-19 08:19:55,681] [INFO] [axolotl.load_model:276] [PID:6554] [RANK:6] patching with flash attention Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:51<00:00, 5.86s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:51<00:00, 5.88s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:51<00:00, 5.88s/it] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.90s/it] [2024-01-19 08:21:53,674] [INFO] [axolotl.load_model:558] [PID:6554] [RANK:6] GPU memory usage after model load: 23.333GB (+0.636GB cache, +1.045GB misc) [2024-01-19 08:21:53,680] [INFO] [axolotl.load_model:581] [PID:6554] [RANK:6] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:53,696] [INFO] [axolotl.load_model:593] [PID:6554] [RANK:6] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:53,702] [INFO] [axolotl.load_lora:698] [PID:6554] [RANK:6] found linear modules: ['k_proj', 'q_proj', 'o_proj', 'w3', 'v_proj', 'gate', 'w2', 'w1'] [2024-01-19 08:21:53,730] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6554] CUDA extension not installed. [2024-01-19 08:21:53,730] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6554] CUDA extension not installed. [2024-01-19 08:21:53,964] [INFO] [axolotl.load_model:558] [PID:6555] [RANK:7] GPU memory usage after model load: 23.333GB (+0.779GB cache, +1.006GB misc) Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.93s/it] [2024-01-19 08:21:53,970] [INFO] [axolotl.load_model:581] [PID:6555] [RANK:7] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:53,986] [INFO] [axolotl.load_model:593] [PID:6555] [RANK:7] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:53,992] [INFO] [axolotl.load_lora:698] [PID:6555] [RANK:7] found linear modules: ['w1', 'w2', 'v_proj', 'k_proj', 'w3', 'o_proj', 'gate', 'q_proj'] [2024-01-19 08:21:54,018] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6555] CUDA extension not installed. [2024-01-19 08:21:54,019] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6555] CUDA extension not installed. [2024-01-19 08:21:54,019] [INFO] [axolotl.load_model:558] [PID:6552] [RANK:4] GPU memory usage after model load: 23.333GB (+0.603GB cache, +1.045GB misc) [2024-01-19 08:21:54,026] [INFO] [axolotl.load_model:581] [PID:6552] [RANK:4] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:54,042] [INFO] [axolotl.load_model:593] [PID:6552] [RANK:4] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:54,047] [INFO] [axolotl.load_lora:698] [PID:6552] [RANK:4] found linear modules: ['v_proj', 'k_proj', 'gate', 'w1', 'q_proj', 'w3', 'w2', 'o_proj'] [2024-01-19 08:21:54,075] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6552] CUDA extension not installed. [2024-01-19 08:21:54,075] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6552] CUDA extension not installed. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.94s/it] [2024-01-19 08:21:54,361] [INFO] [axolotl.load_model:558] [PID:6553] [RANK:5] GPU memory usage after model load: 23.333GB (+0.669GB cache, +1.045GB misc) [2024-01-19 08:21:54,368] [INFO] [axolotl.load_model:581] [PID:6553] [RANK:5] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:54,386] [INFO] [axolotl.load_model:593] [PID:6553] [RANK:5] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:54,393] [INFO] [axolotl.load_lora:698] [PID:6553] [RANK:5] found linear modules: ['k_proj', 'q_proj', 'w1', 'o_proj', 'gate', 'v_proj', 'w2', 'w3'] Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.94s/it] [2024-01-19 08:21:54,421] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6553] CUDA extension not installed. [2024-01-19 08:21:54,421] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6553] CUDA extension not installed. Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:53<00:00, 5.97s/it] [2024-01-19 08:21:54,772] [INFO] [axolotl.load_model:558] [PID:6548] [RANK:0] GPU memory usage after model load: 23.333GB (+0.817GB cache, +1.162GB misc) [2024-01-19 08:21:54,779] [INFO] [axolotl.load_model:581] [PID:6548] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:54,795] [INFO] [axolotl.load_model:593] [PID:6548] [RANK:0] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:54,801] [INFO] [axolotl.load_lora:698] [PID:6548] [RANK:0] found linear modules: ['v_proj', 'w3', 'q_proj', 'k_proj', 'o_proj', 'w1', 'w2', 'gate'] [2024-01-19 08:21:54,827] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6548] CUDA extension not installed. [2024-01-19 08:21:54,827] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6548] CUDA extension not installed. [2024-01-19 08:21:55,145] [INFO] [axolotl.load_model:558] [PID:6550] [RANK:2] GPU memory usage after model load: 23.333GB (+0.722GB cache, +1.045GB misc) [2024-01-19 08:21:55,151] [INFO] [axolotl.load_model:581] [PID:6550] [RANK:2] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:55,167] [INFO] [axolotl.load_model:593] [PID:6550] [RANK:2] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:55,173] [INFO] [axolotl.load_lora:698] [PID:6550] [RANK:2] found linear modules: ['v_proj', 'q_proj', 'gate', 'k_proj', 'w3', 'w1', 'w2', 'o_proj'] [2024-01-19 08:21:55,201] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6550] CUDA extension not installed. [2024-01-19 08:21:55,201] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6550] CUDA extension not installed. [2024-01-19 08:21:55,259] [INFO] [axolotl.load_model:558] [PID:6551] [RANK:3] GPU memory usage after model load: 23.333GB (+0.706GB cache, +1.045GB misc) [2024-01-19 08:21:55,267] [INFO] [axolotl.load_model:581] [PID:6551] [RANK:3] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:55,284] [INFO] [axolotl.load_model:593] [PID:6551] [RANK:3] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:55,291] [INFO] [axolotl.load_lora:698] [PID:6551] [RANK:3] found linear modules: ['v_proj', 'w3', 'o_proj', 'k_proj', 'w1', 'gate', 'q_proj', 'w2'] [2024-01-19 08:21:55,321] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6551] CUDA extension not installed. [2024-01-19 08:21:55,321] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6551] CUDA extension not installed. [2024-01-19 08:21:55,519] [INFO] [axolotl.load_model:558] [PID:6549] [RANK:1] GPU memory usage after model load: 23.333GB (+0.595GB cache, +1.006GB misc) [2024-01-19 08:21:55,526] [INFO] [axolotl.load_model:581] [PID:6549] [RANK:1] converting PEFT model w/ prepare_model_for_kbit_training [2024-01-19 08:21:55,542] [INFO] [axolotl.load_model:593] [PID:6549] [RANK:1] converting modules to torch.bfloat16 for flash attention [2024-01-19 08:21:55,548] [INFO] [axolotl.load_lora:698] [PID:6549] [RANK:1] found linear modules: ['gate', 'w2', 'q_proj', 'w3', 'w1', 'v_proj', 'o_proj', 'k_proj'] [2024-01-19 08:21:55,575] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6549] CUDA extension not installed. [2024-01-19 08:21:55,576] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6549] CUDA extension not installed. trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:21:58,941] [INFO] [axolotl.load_model:625] [PID:6554] [RANK:6] GPU memory usage after adapters: 25.703GB (+0.062GB cache, +1.045GB misc) trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:21:59,286] [INFO] [axolotl.load_model:625] [PID:6555] [RANK:7] GPU memory usage after adapters: 25.704GB (+0.067GB cache, +1.006GB misc) [2024-01-19 08:21:59,297] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,297] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,298] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,299] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,322] [INFO] [axolotl.load_model:625] [PID:6552] [RANK:4] GPU memory usage after adapters: 25.694GB (+0.078GB cache, +1.045GB misc) [2024-01-19 08:21:59,644] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,645] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,646] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,646] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,662] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,663] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,664] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:21:59,664] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:21:59,758] [INFO] [axolotl.load_model:625] [PID:6553] [RANK:5] GPU memory usage after adapters: 25.701GB (+0.079GB cache, +1.045GB misc) trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:22:00,088] [INFO] [axolotl.load_model:625] [PID:6548] [RANK:0] GPU memory usage after adapters: 25.701GB (+0.071GB cache, +1.162GB misc) [2024-01-19 08:22:00,097] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,098] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,099] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,100] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,133] [INFO] [axolotl.train.log:60] [PID:6548] [RANK:0] Pre-saving adapter config to /workspace/axolotl/output/Nous-Hermes-2-Mixtral-8x7B-SFT-CyberGPT [2024-01-19 08:22:00,136] [INFO] [axolotl.train.log:60] [PID:6548] [RANK:0] Starting trainer... [2024-01-19 08:22:00,464] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,465] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,466] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,466] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:22:00,602] [INFO] [axolotl.load_model:625] [PID:6551] [RANK:3] GPU memory usage after adapters: 25.705GB (+0.072GB cache, +1.045GB misc) trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:22:00,712] [INFO] [axolotl.load_model:625] [PID:6550] [RANK:2] GPU memory usage after adapters: 25.699GB (+0.075GB cache, +1.045GB misc) [2024-01-19 08:22:00,938] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,939] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,940] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:00,940] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438 [2024-01-19 08:22:01,054] [INFO] [axolotl.load_model:625] [PID:6549] [RANK:1] GPU memory usage after adapters: 25.695GB (+0.069GB cache, +1.006GB misc) [2024-01-19 08:22:01,075] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,076] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,076] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,077] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,762] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,763] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,763] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:01,764] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root... Detected CUDA files, patching ldflags Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja... Building extension module fused_adam... Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module fused_adam... Time to load fused_adam op: 0.08674335479736328 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10159158706665039 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.10232019424438477 seconds Time to load fused_adam op: 0.10149574279785156 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10169720649719238 seconds Loading extension module fused_adam... Loading extension module fused_adam... Time to load fused_adam op: 0.10174274444580078 seconds Time to load fused_adam op: 0.10324215888977051 seconds Loading extension module fused_adam... Time to load fused_adam op: 0.10189294815063477 seconds Parameter Offload: Total persistent parameters: 2895872 in 193 params [2024-01-19 08:22:14,416] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,417] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,432] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,433] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,435] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,436] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,438] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,439] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,455] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,456] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,456] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,457] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,467] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 [2024-01-19 08:22:14,469] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375 0%| | 0/16 [00:00 fire.Fire(do_cli) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in do_cli train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta) File "/workspace/axolotl/src/axolotl/train.py", line 142, in train trainer.train(resume_from_checkpoint=resume_from_checkpoint) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1543, in train return inner_training_loop( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2746, in training_step self.accelerator.backward(loss) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1983, in backward self.deepspeed_engine_wrapped.backward(loss, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1955, in backward self.optimizer.backward(loss, retain_graph=retain_graph) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, **kwargs) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward scaled_loss.backward(retain_graph=retain_graph) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply return user_fn(self, *args) File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 279, in backward q, k, v, out, softmax_lse, cu_seqlens, rng_state = ctx.saved_tensors RuntimeError: !grad_accumulator_.expired() INTERNAL ASSERT FAILED at "../torch/csrc/autograd/saved_variable.cpp":226, please report a bug to PyTorch. No grad accumulator for a saved leaf

deepspeed zero3 config.json


{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"last_batch_iteration": -1,
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 2e9,
"stage3_max_reuse_distance": 2e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}

### Config yaml

_No response_

### Possible solution

_No response_

### Which Operating Systems are you using?

- [X] Linux
- [ ] macOS
- [ ] Windows

### Python Version

3.10

### axolotl branch-commit

main

### Acknowledgements

- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
vip-china commented 8 months ago

To supplement,:there were no errors when using zero2, but there were new errors after training. Does it not support Mixtra? 企业微信截图_1705731459109