this
root@f2e11ed3bbe4:/workspace/axolotl# sh qb.sh
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_processes` was set to a value of `8`
More than one GPU was found, enabling multi-GPU training.
If this was unintended please pass in `--num_processes=1`.
`--num_machines` was set to a value of `1`
`--mixed_precision` was set to a value of `'no'`
`--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:106: UserWarning:
================================================================================
WARNING: Manual override via BNB_CUDA_VERSION env variable detected!
BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version.
If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION=
If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH
For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:
[2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:186] [PID:6549] [RANK:1] BOS: 1 /
[2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:187] [PID:6549] [RANK:1] PAD: 2 /
[2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:188] [PID:6549] [RANK:1] UNK: 0 /
[2024-01-19 08:19:49,064] [INFO] [axolotl.load_tokenizer:193] [PID:6549] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:185] [PID:6555] [RANK:7] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:186] [PID:6555] [RANK:7] BOS: 1 /
[2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:187] [PID:6555] [RANK:7] PAD: 2 /
[2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:188] [PID:6555] [RANK:7] UNK: 0 /
[2024-01-19 08:19:49,070] [INFO] [axolotl.load_tokenizer:193] [PID:6555] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:185] [PID:6554] [RANK:6] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:186] [PID:6554] [RANK:6] BOS: 1 /
[2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:187] [PID:6554] [RANK:6] PAD: 2 /
[2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:188] [PID:6554] [RANK:6] UNK: 0 /
[2024-01-19 08:19:49,071] [INFO] [axolotl.load_tokenizer:193] [PID:6554] [RANK:6] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:185] [PID:6548] [RANK:0] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:186] [PID:6548] [RANK:0] BOS: 1 /
[2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:187] [PID:6548] [RANK:0] PAD: 2 /
[2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:188] [PID:6548] [RANK:0] UNK: 0 /
[2024-01-19 08:19:49,077] [INFO] [axolotl.load_tokenizer:193] [PID:6548] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:49,078] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6548] [RANK:0] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54...
[2024-01-19 08:19:49,082] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6548] [RANK:0] Prepared dataset loaded from disk...
[2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:185] [PID:6551] [RANK:3] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:186] [PID:6551] [RANK:3] BOS: 1 /
[2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:187] [PID:6551] [RANK:3] PAD: 2 /
[2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:188] [PID:6551] [RANK:3] UNK: 0 /
[2024-01-19 08:19:49,084] [INFO] [axolotl.load_tokenizer:193] [PID:6551] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:185] [PID:6553] [RANK:5] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:186] [PID:6553] [RANK:5] BOS: 1 /
[2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:187] [PID:6553] [RANK:5] PAD: 2 /
[2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:188] [PID:6553] [RANK:5] UNK: 0 /
[2024-01-19 08:19:49,096] [INFO] [axolotl.load_tokenizer:193] [PID:6553] [RANK:5] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:185] [PID:6550] [RANK:2] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:186] [PID:6550] [RANK:2] BOS: 1 /
[2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:187] [PID:6550] [RANK:2] PAD: 2 /
[2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:188] [PID:6550] [RANK:2] UNK: 0 /
[2024-01-19 08:19:49,122] [INFO] [axolotl.load_tokenizer:193] [PID:6550] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:185] [PID:6552] [RANK:4] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:186] [PID:6552] [RANK:4] BOS: 1 /
[2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:187] [PID:6552] [RANK:4] PAD: 2 /
[2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:188] [PID:6552] [RANK:4] UNK: 0 /
[2024-01-19 08:19:49,149] [INFO] [axolotl.load_tokenizer:193] [PID:6552] [RANK:4] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6552] [RANK:4] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54...
[2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6553] [RANK:5] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54...
[2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6551] [RANK:3] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54...
[2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6550] [RANK:2] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54...
[2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6549] [RANK:1] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54...
[2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6554] [RANK:6] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54...
[2024-01-19 08:19:50,532] [INFO] [axolotl.load_tokenized_prepared_datasets:143] [PID:6555] [RANK:7] Loading prepared dataset from disk at last_run_prepared/f88c6beff226b66e15a58656acc22e54...
[2024-01-19 08:19:50,535] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6551] [RANK:3] Prepared dataset loaded from disk...
[2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6549] [RANK:1] Prepared dataset loaded from disk...
[2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6552] [RANK:4] Prepared dataset loaded from disk...
[2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6553] [RANK:5] Prepared dataset loaded from disk...
[2024-01-19 08:19:50,536] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6550] [RANK:2] Prepared dataset loaded from disk...
[2024-01-19 08:19:50,537] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6555] [RANK:7] Prepared dataset loaded from disk...
[2024-01-19 08:19:50,537] [INFO] [axolotl.load_tokenized_prepared_datasets:145] [PID:6554] [RANK:6] Prepared dataset loaded from disk...
[2024-01-19 08:19:50,970] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] total_num_tokens: 515000
[2024-01-19 08:19:50,982] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] `total_supervised_tokens: 389502`
[2024-01-19 08:19:54,923] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375
[2024-01-19 08:19:54,923] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] data_loader_len: 119
[2024-01-19 08:19:55,127] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375
[2024-01-19 08:19:55,134] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375
[2024-01-19 08:19:55,158] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375
[2024-01-19 08:19:55,181] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375
[2024-01-19 08:19:55,230] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375
[2024-01-19 08:19:55,287] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375
[2024-01-19 08:19:55,539] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 1.0 total_num_tokens per device: 64375
[2024-01-19 08:19:55,569] [INFO] [axolotl.log:60] [PID:6548] [RANK:0] sample_packing_eff_est across ranks: [0.904549777507782, 0.8917192816734314, 0.8980887532234192, 0.904549777507782, 0.8917192816734314, 0.8980887532234192, 0.904549777507782, 0.8980887532234192]
[2024-01-19 08:19:55,570] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] sample_packing_eff_est: 0.91
[2024-01-19 08:19:55,570] [DEBUG] [axolotl.log:60] [PID:6548] [RANK:0] total_num_steps: 14
[2024-01-19 08:19:55,577] [DEBUG] [axolotl.train.log:60] [PID:6548] [RANK:0] loading tokenizer... /data1/ljf2/data/Nous-Hermes-2-Mixtral-8x7B-SFT
[2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:185] [PID:6548] [RANK:0] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:186] [PID:6548] [RANK:0] BOS: 1 /
[2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:187] [PID:6548] [RANK:0] PAD: 2 /
[2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:188] [PID:6548] [RANK:0] UNK: 0 /
[2024-01-19 08:19:55,628] [INFO] [axolotl.load_tokenizer:193] [PID:6548] [RANK:0] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:55,629] [DEBUG] [axolotl.train.log:60] [PID:6548] [RANK:0] loading model and peft_config...
[2024-01-19 08:19:55,637] [INFO] [axolotl.load_model:264] [PID:6548] [RANK:0] patching with flash attention
[2024-01-19 08:19:55,637] [INFO] [axolotl.load_model:276] [PID:6548] [RANK:0] patching with flash attention
[2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:185] [PID:6552] [RANK:4] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:186] [PID:6552] [RANK:4] BOS: 1 /
[2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:187] [PID:6552] [RANK:4] PAD: 2 /
[2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:188] [PID:6552] [RANK:4] UNK: 0 /
[2024-01-19 08:19:55,638] [INFO] [axolotl.load_tokenizer:193] [PID:6552] [RANK:4] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:185] [PID:6549] [RANK:1] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6549] [RANK:1] BOS: 1 /
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6549] [RANK:1] PAD: 2 /
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:188] [PID:6549] [RANK:1] UNK: 0 /
[2024-01-19 08:19:55,639] [INFO] [axolotl.load_tokenizer:193] [PID:6549] [RANK:1] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:185] [PID:6555] [RANK:7] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6555] [RANK:7] BOS: 1 /
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:185] [PID:6553] [RANK:5] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6555] [RANK:7] PAD: 2 /
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6553] [RANK:5] BOS: 1 /
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:188] [PID:6555] [RANK:7] UNK: 0 /
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6553] [RANK:5] PAD: 2 /
[2024-01-19 08:19:55,639] [INFO] [axolotl.load_tokenizer:193] [PID:6555] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:188] [PID:6553] [RANK:5] UNK: 0 /
[2024-01-19 08:19:55,640] [INFO] [axolotl.load_tokenizer:193] [PID:6553] [RANK:5] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:185] [PID:6551] [RANK:3] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:186] [PID:6551] [RANK:3] BOS: 1 /
[2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:187] [PID:6551] [RANK:3] PAD: 2 /
[2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:188] [PID:6551] [RANK:3] UNK: 0 /
[2024-01-19 08:19:55,640] [INFO] [axolotl.load_tokenizer:193] [PID:6551] [RANK:3] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:55,641] [DEBUG] [axolotl.load_tokenizer:185] [PID:6550] [RANK:2] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:186] [PID:6550] [RANK:2] BOS: 1 /
[2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:187] [PID:6550] [RANK:2] PAD: 2 /
[2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:188] [PID:6550] [RANK:2] UNK: 0 /
[2024-01-19 08:19:55,642] [INFO] [axolotl.load_tokenizer:193] [PID:6550] [RANK:2] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:55,646] [INFO] [axolotl.load_model:264] [PID:6549] [RANK:1] patching with flash attention
[2024-01-19 08:19:55,647] [INFO] [axolotl.load_model:276] [PID:6549] [RANK:1] patching with flash attention
[2024-01-19 08:19:55,647] [INFO] [axolotl.load_model:264] [PID:6552] [RANK:4] patching with flash attention
[2024-01-19 08:19:55,647] [INFO] [axolotl.load_model:276] [PID:6552] [RANK:4] patching with flash attention
[2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:264] [PID:6553] [RANK:5] patching with flash attention
[2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:264] [PID:6555] [RANK:7] patching with flash attention
[2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:264] [PID:6551] [RANK:3] patching with flash attention
[2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:276] [PID:6553] [RANK:5] patching with flash attention
[2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:276] [PID:6555] [RANK:7] patching with flash attention
[2024-01-19 08:19:55,648] [INFO] [axolotl.load_model:276] [PID:6551] [RANK:3] patching with flash attention
[2024-01-19 08:19:55,650] [INFO] [axolotl.load_model:264] [PID:6550] [RANK:2] patching with flash attention
[2024-01-19 08:19:55,650] [INFO] [axolotl.load_model:276] [PID:6550] [RANK:2] patching with flash attention
[2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:185] [PID:6554] [RANK:6] EOS: 32000 / <|im_end|>
[2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:186] [PID:6554] [RANK:6] BOS: 1 /
[2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:187] [PID:6554] [RANK:6] PAD: 2 /
[2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:188] [PID:6554] [RANK:6] UNK: 0 /
[2024-01-19 08:19:55,666] [INFO] [axolotl.load_tokenizer:193] [PID:6554] [RANK:6] No Chat template selected. Consider adding a chat template for easier inference.
[2024-01-19 08:19:55,680] [INFO] [axolotl.load_model:264] [PID:6554] [RANK:6] patching with flash attention
[2024-01-19 08:19:55,681] [INFO] [axolotl.load_model:276] [PID:6554] [RANK:6] patching with flash attention
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:51<00:00, 5.86s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:51<00:00, 5.88s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:51<00:00, 5.88s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.90s/it]
[2024-01-19 08:21:53,674] [INFO] [axolotl.load_model:558] [PID:6554] [RANK:6] GPU memory usage after model load: 23.333GB (+0.636GB cache, +1.045GB misc)
[2024-01-19 08:21:53,680] [INFO] [axolotl.load_model:581] [PID:6554] [RANK:6] converting PEFT model w/ prepare_model_for_kbit_training
[2024-01-19 08:21:53,696] [INFO] [axolotl.load_model:593] [PID:6554] [RANK:6] converting modules to torch.bfloat16 for flash attention
[2024-01-19 08:21:53,702] [INFO] [axolotl.load_lora:698] [PID:6554] [RANK:6] found linear modules: ['k_proj', 'q_proj', 'o_proj', 'w3', 'v_proj', 'gate', 'w2', 'w1']
[2024-01-19 08:21:53,730] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6554] CUDA extension not installed.
[2024-01-19 08:21:53,730] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6554] CUDA extension not installed.
[2024-01-19 08:21:53,964] [INFO] [axolotl.load_model:558] [PID:6555] [RANK:7] GPU memory usage after model load: 23.333GB (+0.779GB cache, +1.006GB misc)
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.93s/it]
[2024-01-19 08:21:53,970] [INFO] [axolotl.load_model:581] [PID:6555] [RANK:7] converting PEFT model w/ prepare_model_for_kbit_training
[2024-01-19 08:21:53,986] [INFO] [axolotl.load_model:593] [PID:6555] [RANK:7] converting modules to torch.bfloat16 for flash attention
[2024-01-19 08:21:53,992] [INFO] [axolotl.load_lora:698] [PID:6555] [RANK:7] found linear modules: ['w1', 'w2', 'v_proj', 'k_proj', 'w3', 'o_proj', 'gate', 'q_proj']
[2024-01-19 08:21:54,018] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6555] CUDA extension not installed.
[2024-01-19 08:21:54,019] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6555] CUDA extension not installed.
[2024-01-19 08:21:54,019] [INFO] [axolotl.load_model:558] [PID:6552] [RANK:4] GPU memory usage after model load: 23.333GB (+0.603GB cache, +1.045GB misc)
[2024-01-19 08:21:54,026] [INFO] [axolotl.load_model:581] [PID:6552] [RANK:4] converting PEFT model w/ prepare_model_for_kbit_training
[2024-01-19 08:21:54,042] [INFO] [axolotl.load_model:593] [PID:6552] [RANK:4] converting modules to torch.bfloat16 for flash attention
[2024-01-19 08:21:54,047] [INFO] [axolotl.load_lora:698] [PID:6552] [RANK:4] found linear modules: ['v_proj', 'k_proj', 'gate', 'w1', 'q_proj', 'w3', 'w2', 'o_proj']
[2024-01-19 08:21:54,075] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6552] CUDA extension not installed.
[2024-01-19 08:21:54,075] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6552] CUDA extension not installed.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.94s/it]
[2024-01-19 08:21:54,361] [INFO] [axolotl.load_model:558] [PID:6553] [RANK:5] GPU memory usage after model load: 23.333GB (+0.669GB cache, +1.045GB misc)
[2024-01-19 08:21:54,368] [INFO] [axolotl.load_model:581] [PID:6553] [RANK:5] converting PEFT model w/ prepare_model_for_kbit_training
[2024-01-19 08:21:54,386] [INFO] [axolotl.load_model:593] [PID:6553] [RANK:5] converting modules to torch.bfloat16 for flash attention
[2024-01-19 08:21:54,393] [INFO] [axolotl.load_lora:698] [PID:6553] [RANK:5] found linear modules: ['k_proj', 'q_proj', 'w1', 'o_proj', 'gate', 'v_proj', 'w2', 'w3']
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:52<00:00, 5.94s/it]
[2024-01-19 08:21:54,421] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6553] CUDA extension not installed.
[2024-01-19 08:21:54,421] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6553] CUDA extension not installed.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [01:53<00:00, 5.97s/it]
[2024-01-19 08:21:54,772] [INFO] [axolotl.load_model:558] [PID:6548] [RANK:0] GPU memory usage after model load: 23.333GB (+0.817GB cache, +1.162GB misc)
[2024-01-19 08:21:54,779] [INFO] [axolotl.load_model:581] [PID:6548] [RANK:0] converting PEFT model w/ prepare_model_for_kbit_training
[2024-01-19 08:21:54,795] [INFO] [axolotl.load_model:593] [PID:6548] [RANK:0] converting modules to torch.bfloat16 for flash attention
[2024-01-19 08:21:54,801] [INFO] [axolotl.load_lora:698] [PID:6548] [RANK:0] found linear modules: ['v_proj', 'w3', 'q_proj', 'k_proj', 'o_proj', 'w1', 'w2', 'gate']
[2024-01-19 08:21:54,827] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6548] CUDA extension not installed.
[2024-01-19 08:21:54,827] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6548] CUDA extension not installed.
[2024-01-19 08:21:55,145] [INFO] [axolotl.load_model:558] [PID:6550] [RANK:2] GPU memory usage after model load: 23.333GB (+0.722GB cache, +1.045GB misc)
[2024-01-19 08:21:55,151] [INFO] [axolotl.load_model:581] [PID:6550] [RANK:2] converting PEFT model w/ prepare_model_for_kbit_training
[2024-01-19 08:21:55,167] [INFO] [axolotl.load_model:593] [PID:6550] [RANK:2] converting modules to torch.bfloat16 for flash attention
[2024-01-19 08:21:55,173] [INFO] [axolotl.load_lora:698] [PID:6550] [RANK:2] found linear modules: ['v_proj', 'q_proj', 'gate', 'k_proj', 'w3', 'w1', 'w2', 'o_proj']
[2024-01-19 08:21:55,201] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6550] CUDA extension not installed.
[2024-01-19 08:21:55,201] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6550] CUDA extension not installed.
[2024-01-19 08:21:55,259] [INFO] [axolotl.load_model:558] [PID:6551] [RANK:3] GPU memory usage after model load: 23.333GB (+0.706GB cache, +1.045GB misc)
[2024-01-19 08:21:55,267] [INFO] [axolotl.load_model:581] [PID:6551] [RANK:3] converting PEFT model w/ prepare_model_for_kbit_training
[2024-01-19 08:21:55,284] [INFO] [axolotl.load_model:593] [PID:6551] [RANK:3] converting modules to torch.bfloat16 for flash attention
[2024-01-19 08:21:55,291] [INFO] [axolotl.load_lora:698] [PID:6551] [RANK:3] found linear modules: ['v_proj', 'w3', 'o_proj', 'k_proj', 'w1', 'gate', 'q_proj', 'w2']
[2024-01-19 08:21:55,321] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6551] CUDA extension not installed.
[2024-01-19 08:21:55,321] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6551] CUDA extension not installed.
[2024-01-19 08:21:55,519] [INFO] [axolotl.load_model:558] [PID:6549] [RANK:1] GPU memory usage after model load: 23.333GB (+0.595GB cache, +1.006GB misc)
[2024-01-19 08:21:55,526] [INFO] [axolotl.load_model:581] [PID:6549] [RANK:1] converting PEFT model w/ prepare_model_for_kbit_training
[2024-01-19 08:21:55,542] [INFO] [axolotl.load_model:593] [PID:6549] [RANK:1] converting modules to torch.bfloat16 for flash attention
[2024-01-19 08:21:55,548] [INFO] [axolotl.load_lora:698] [PID:6549] [RANK:1] found linear modules: ['gate', 'w2', 'q_proj', 'w3', 'w1', 'v_proj', 'o_proj', 'k_proj']
[2024-01-19 08:21:55,575] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda.:16] [PID:6549] CUDA extension not installed.
[2024-01-19 08:21:55,576] [WARNING] [auto_gptq.nn_modules.qlinear.qlinear_cuda_old.:15] [PID:6549] CUDA extension not installed.
trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438
[2024-01-19 08:21:58,941] [INFO] [axolotl.load_model:625] [PID:6554] [RANK:6] GPU memory usage after adapters: 25.703GB (+0.062GB cache, +1.045GB misc)
trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438
trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438
[2024-01-19 08:21:59,286] [INFO] [axolotl.load_model:625] [PID:6555] [RANK:7] GPU memory usage after adapters: 25.704GB (+0.067GB cache, +1.006GB misc)
[2024-01-19 08:21:59,297] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,297] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,298] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,299] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,322] [INFO] [axolotl.load_model:625] [PID:6552] [RANK:4] GPU memory usage after adapters: 25.694GB (+0.078GB cache, +1.045GB misc)
[2024-01-19 08:21:59,644] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,645] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,646] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,646] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,662] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,663] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,664] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:21:59,664] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438
[2024-01-19 08:21:59,758] [INFO] [axolotl.load_model:625] [PID:6553] [RANK:5] GPU memory usage after adapters: 25.701GB (+0.079GB cache, +1.045GB misc)
trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438
[2024-01-19 08:22:00,088] [INFO] [axolotl.load_model:625] [PID:6548] [RANK:0] GPU memory usage after adapters: 25.701GB (+0.071GB cache, +1.162GB misc)
[2024-01-19 08:22:00,097] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,098] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,099] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,100] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,133] [INFO] [axolotl.train.log:60] [PID:6548] [RANK:0] Pre-saving adapter config to /workspace/axolotl/output/Nous-Hermes-2-Mixtral-8x7B-SFT-CyberGPT
[2024-01-19 08:22:00,136] [INFO] [axolotl.train.log:60] [PID:6548] [RANK:0] Starting trainer...
[2024-01-19 08:22:00,464] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,465] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,466] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,466] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438
[2024-01-19 08:22:00,602] [INFO] [axolotl.load_model:625] [PID:6551] [RANK:3] GPU memory usage after adapters: 25.705GB (+0.072GB cache, +1.045GB misc)
trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438
[2024-01-19 08:22:00,712] [INFO] [axolotl.load_model:625] [PID:6550] [RANK:2] GPU memory usage after adapters: 25.699GB (+0.075GB cache, +1.045GB misc)
[2024-01-19 08:22:00,938] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,939] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,940] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:00,940] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
trainable params: 746,610,688 || all params: 47,449,419,776 || trainable%: 1.5734874978969438
[2024-01-19 08:22:01,054] [INFO] [axolotl.load_model:625] [PID:6549] [RANK:1] GPU memory usage after adapters: 25.695GB (+0.069GB cache, +1.006GB misc)
[2024-01-19 08:22:01,075] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:01,076] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:01,076] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:01,077] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:01,762] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:01,763] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:01,763] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:01,764] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Using /root/.cache/torch_extensions/py310_cu118 as PyTorch extensions root...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu118/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module fused_adam...
Time to load fused_adam op: 0.08674335479736328 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10159158706665039 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.10232019424438477 seconds
Time to load fused_adam op: 0.10149574279785156 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10169720649719238 seconds
Loading extension module fused_adam...
Loading extension module fused_adam...
Time to load fused_adam op: 0.10174274444580078 seconds
Time to load fused_adam op: 0.10324215888977051 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 0.10189294815063477 seconds
Parameter Offload: Total persistent parameters: 2895872 in 193 params
[2024-01-19 08:22:14,416] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,417] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6555] [RANK:7] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,432] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,433] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6553] [RANK:5] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,435] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,436] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6552] [RANK:4] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,438] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,439] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6551] [RANK:3] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,455] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,456] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6550] [RANK:2] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,456] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,457] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6554] [RANK:6] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,467] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,469] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6549] [RANK:1] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
0%| | 0/16 [00:00, ?it/s][2024-01-19 08:22:14,876] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
[2024-01-19 08:22:14,876] [INFO] [axolotl.utils.samplers.multipack._len_est:178] [PID:6548] [RANK:0] packing_efficiency_estimate: 0.91 total_num_tokens per device: 64375
Traceback (most recent call last):
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/miniconda3/envs/py3.10/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 42, in
fire.Fire(do_cli)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/workspace/axolotl/src/axolotl/cli/train.py", line 38, in do_cli
train(cfg=parsed_cfg, cli_args=parsed_cli_args, dataset_meta=dataset_meta)
File "/workspace/axolotl/src/axolotl/train.py", line 142, in train
trainer.train(resume_from_checkpoint=resume_from_checkpoint)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1543, in train
return inner_training_loop(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/transformers/trainer.py", line 2746, in training_step
self.accelerator.backward(loss)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/accelerator.py", line 1983, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward
self.engine.backward(loss, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1955, in backward
self.optimizer.backward(loss, retain_graph=retain_graph)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 2135, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/_tensor.py", line 487, in backward
torch.autograd.backward(
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/__init__.py", line 200, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/torch/autograd/function.py", line 274, in apply
return user_fn(self, *args)
File "/root/miniconda3/envs/py3.10/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 279, in backward
q, k, v, out, softmax_lse, cu_seqlens, rng_state = ctx.saved_tensors
RuntimeError: !grad_accumulator_.expired() INTERNAL ASSERT FAILED at "../torch/csrc/autograd/saved_variable.cpp":226, please report a bug to PyTorch. No grad accumulator for a saved leaf
deepspeed zero3 config.json
{
"bf16": {
"enabled": true
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"last_batch_iteration": -1,
"total_num_steps": "auto",
"warmup_min_lr": "auto",
"warmup_max_lr": "auto",
"warmup_num_steps": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 2e9,
"stage3_max_reuse_distance": 2e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
### Config yaml
_No response_
### Possible solution
_No response_
### Which Operating Systems are you using?
- [X] Linux
- [ ] macOS
- [ ] Windows
### Python Version
3.10
### axolotl branch-commit
main
### Acknowledgements
- [X] My issue title is concise, descriptive, and in title casing.
- [X] I have searched the existing issues to make sure this bug has not been reported yet.
- [X] I am using the latest version of axolotl.
- [X] I have provided enough information for the maintainers to reproduce and diagnose the issue.
Please check that this issue hasn't been reported before.
Expected Behavior
Can successfully sft
Current behaviour
Unable to perform SFT training,The following error is reported
Attempted to cancel flash_ ATTN, but the following error was reported again
Steps to reproduce
config.yml:
Complete error output
this
root@f2e11ed3bbe4:/workspace/axolotl# sh qb.sh The following values were not passed to `accelerate launch` and had defaults used instead: `--num_processes` was set to a value of `8` More than one GPU was found, enabling multi-GPU training. If this was unintended please pass in `--num_processes=1`. `--num_machines` was set to a value of `1` `--mixed_precision` was set to a value of `'no'` `--dynamo_backend` was set to a value of `'no'` To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. /root/miniconda3/envs/py3.10/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:106: UserWarning: ================================================================================ WARNING: Manual override via BNB_CUDA_VERSION env variable detected! BNB_CUDA_VERSION=XXX can be used to load a bitsandbytes version that is different from the PyTorch CUDA version. If this was unintended set the BNB_CUDA_VERSION variable to an empty string: export BNB_CUDA_VERSION= If you use the manual override make sure the right libcudart.so is in your LD_LIBRARY_PATH For example by adding the following to your .bashrc: export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:[2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:187] [PID:6549] [RANK:1] PAD: 2 /[2024-01-19 08:19:49,064] [DEBUG] [axolotl.load_tokenizer:188] [PID:6549] [RANK:1] UNK: 0 /[2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:187] [PID:6555] [RANK:7] PAD: 2 /[2024-01-19 08:19:49,070] [DEBUG] [axolotl.load_tokenizer:188] [PID:6555] [RANK:7] UNK: 0 /[2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:187] [PID:6554] [RANK:6] PAD: 2 /[2024-01-19 08:19:49,071] [DEBUG] [axolotl.load_tokenizer:188] [PID:6554] [RANK:6] UNK: 0 /[2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:187] [PID:6548] [RANK:0] PAD: 2 /[2024-01-19 08:19:49,077] [DEBUG] [axolotl.load_tokenizer:188] [PID:6548] [RANK:0] UNK: 0 /[2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:187] [PID:6551] [RANK:3] PAD: 2 /[2024-01-19 08:19:49,084] [DEBUG] [axolotl.load_tokenizer:188] [PID:6551] [RANK:3] UNK: 0 /[2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:187] [PID:6553] [RANK:5] PAD: 2 /[2024-01-19 08:19:49,096] [DEBUG] [axolotl.load_tokenizer:188] [PID:6553] [RANK:5] UNK: 0 /[2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:187] [PID:6550] [RANK:2] PAD: 2 /[2024-01-19 08:19:49,122] [DEBUG] [axolotl.load_tokenizer:188] [PID:6550] [RANK:2] UNK: 0 /[2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:187] [PID:6552] [RANK:4] PAD: 2 /[2024-01-19 08:19:49,149] [DEBUG] [axolotl.load_tokenizer:188] [PID:6552] [RANK:4] UNK: 0 /[2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:187] [PID:6548] [RANK:0] PAD: 2 /[2024-01-19 08:19:55,628] [DEBUG] [axolotl.load_tokenizer:188] [PID:6548] [RANK:0] UNK: 0 /[2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:187] [PID:6552] [RANK:4] PAD: 2 /[2024-01-19 08:19:55,638] [DEBUG] [axolotl.load_tokenizer:188] [PID:6552] [RANK:4] UNK: 0 /[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6549] [RANK:1] PAD: 2 /[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:188] [PID:6549] [RANK:1] UNK: 0 /[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:185] [PID:6553] [RANK:5] EOS: 32000 / <|im_end|> [2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6555] [RANK:7] PAD: 2 /[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:186] [PID:6553] [RANK:5] BOS: 1 /[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:188] [PID:6555] [RANK:7] UNK: 0 /
[2024-01-19 08:19:55,639] [DEBUG] [axolotl.load_tokenizer:187] [PID:6553] [RANK:5] PAD: 2 / [2024-01-19 08:19:55,639] [INFO] [axolotl.load_tokenizer:193] [PID:6555] [RANK:7] No Chat template selected. Consider adding a chat template for easier inference. [2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:188] [PID:6553] [RANK:5] UNK: 0 /[2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:187] [PID:6551] [RANK:3] PAD: 2 /[2024-01-19 08:19:55,640] [DEBUG] [axolotl.load_tokenizer:188] [PID:6551] [RANK:3] UNK: 0 /[2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:187] [PID:6550] [RANK:2] PAD: 2 /[2024-01-19 08:19:55,642] [DEBUG] [axolotl.load_tokenizer:188] [PID:6550] [RANK:2] UNK: 0 /[2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:187] [PID:6554] [RANK:6] PAD: 2 /[2024-01-19 08:19:55,666] [DEBUG] [axolotl.load_tokenizer:188] [PID:6554] [RANK:6] UNK: 0 /deepspeed zero3 config.json