Training Freeze after "Shuffle merged datasets" (and adding position ids)

Please check that this issue hasn't been reported before.

[X] I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

Training should proceed without issue when accelerate launch --use_deepspeed -m axolotl.cli.train axolotl_bittensor_llama3_finetuning.yaml is run

Current behaviour

The datasets tokenize but invariably training will fail to begin, as axolotl freezes up after datasets are being shuffled:

This is on a second or third run. The first one, axolotl froze after the adding position ids step.

Config:

base_model: meta-llama/Meta-Llama-3-8B
# Heralax/bittensor-mistral-pretrained-base-1
#mistralai/Mistral-7B-v0.1
# Heralax/bittensor-mistral-pretrained-base-1
#mistralai/Mistral-7B-v0.1
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
is_mistral_derived_model: false

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: json
    data_files: ./essays_annotation_syspromptvaried.jsonl
    ds_type: json
    type: sharegpt
    conversation: chatml
  - path: json
    data_files: ./tweets_annotation_syspromptvaried.jsonl
    ds_type: json
    type: sharegpt
    conversation: chatml
  - path: json
    data_files: ./autometa_4_percent.json
    ds_type: json
    type: sharegpt
    conversation: chatml
  # - path: json
  #   data_files: paul_graham_essays_completion.json
  #   ds_type: json
  #   type: completion

dataset_prepared_path: last_run_prepared
output_dir: ./paulgraham-finetune-out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
shuffle_merged_datasets: true

wandb_project: pg-test
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 6
micro_batch_size: 2
eval_batch_size: 1
num_epochs: 7
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.000029
weight_decay: 0
# Gradient clipping max norm
max_grad_norm: 1.0
noisy_embedding_alpha: 0

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

# fsdp:
  # - full_shard
  # - auto_wrap
# fsdp_config:
  # fsdp_offload_params: false
  # fsdp_state_dict_type: FULL_STATE_DICT
  # fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
# warmup_steps: 10
warmup_ratio: 0.5
auto_resume_from_checkpoints: false
#warmup_ratio: 0.5
eval_steps: 10
saves_per_epoch: 1
eval_sample_packing: false
save_total_limit: 2
debug:
deepspeed: deepspeed_configs/zero2.json
chat_template: chatml
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Oddly enough I do not see a last_run_prepared folder.

This is on 6x A40s rented using RunPod, using the official axolotl docker image.

Steps to reproduce

Run config (presumably using any sharegpt sets + any completion set will cause problems)
Wait for datasets to tokenize
Observe freezing

rolling back to 5f58555bd0dbf15cae25fc021eb00421e53e47b2 does not seem to have helped.

Config yaml

base_model: meta-llama/Meta-Llama-3-8B
# Heralax/bittensor-mistral-pretrained-base-1
#mistralai/Mistral-7B-v0.1
# Heralax/bittensor-mistral-pretrained-base-1
#mistralai/Mistral-7B-v0.1
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer
is_mistral_derived_model: false

load_in_8bit: false
load_in_4bit: false
strict: false

datasets:
  - path: json
    data_files: ./essays_annotation_syspromptvaried.jsonl
    ds_type: json
    type: sharegpt
    conversation: chatml
  - path: json
    data_files: ./tweets_annotation_syspromptvaried.jsonl
    ds_type: json
    type: sharegpt
    conversation: chatml
  - path: json
    data_files: ./autometa_4_percent.json
    ds_type: json
    type: sharegpt
    conversation: chatml
  # - path: json
  #   data_files: paul_graham_essays_completion.json
  #   ds_type: json
  #   type: completion

dataset_prepared_path: last_run_prepared
output_dir: ./paulgraham-finetune-out

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true
shuffle_merged_datasets: true

wandb_project: pg-test
wandb_entity:
wandb_watch:
wandb_run_id:
wandb_log_model:

gradient_accumulation_steps: 6
micro_batch_size: 2
eval_batch_size: 1
num_epochs: 7
optimizer: paged_adamw_8bit
lr_scheduler: cosine
learning_rate: 0.000029
weight_decay: 0
# Gradient clipping max norm
max_grad_norm: 1.0
noisy_embedding_alpha: 0

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: unsloth
early_stopping_patience:
resume_from_checkpoint: 
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

# fsdp:
  # - full_shard
  # - auto_wrap
# fsdp_config:
  # fsdp_offload_params: false
  # fsdp_state_dict_type: FULL_STATE_DICT
  # fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
# warmup_steps: 10
warmup_ratio: 0.5
auto_resume_from_checkpoints: false
#warmup_ratio: 0.5
eval_steps: 10
saves_per_epoch: 1
eval_sample_packing: false
save_total_limit: 2
debug:
deepspeed: deepspeed_configs/zero2.json
chat_template: chatml
special_tokens:
  bos_token: "<s>"
  eos_token: "</s>"
  unk_token: "<unk>"
tokens:
  - "<|im_start|>"
  - "<|im_end|>"

Possible solution

No response

Which Operating Systems are you using?

[X] Linux
[ ] macOS
[ ] Windows

Python Version

3.11

axolotl branch-commit

main/c86c32a

Acknowledgements

[X] My issue title is concise, descriptive, and in title casing.
[X] I have searched the existing issues to make sure this bug has not been reported yet.
[X] I am using the latest version of axolotl.
[X] I have provided enough information for the maintainers to reproduce and diagnose the issue.

axolotl-ai-cloud / axolotl