Encountering NaN grad_norm and loss values when training with DeepSpeed and OrionForCausalLM model

Dear FuseLLM author,

I am currently attempting to use FuseLLM to fine-tune for Korean models by configuring OrionStarAI/Orion-14B-Base as a base model, and beomi/OPEN-SOLAR-KO-10.7B and beomi/Yi-Ko-6B to be blending model using DeepSpeed.

However, I am encountering NaN (Not a Number) values for grad_norm and loss during the training process. I suspect that the issue might be related to the change of the base model to OrionForCausalLM. I would greatly appreciate your help in resolving this problem.

Problem Description:

When I initiate the training process using DeepSpeed with the OrionForCausalLM model, I observe the following behavior, with flash attention is turned on (grad_norm is nan from the beginning)

 {'loss': 0.0, 'grad_norm': tensor(nan, device='cuda:0'), 'learning_rate': 4.351851851851852e-06, 'epoch': 0.58}

When even turning off the flash attention, the first batch comes with no problem of nan values, but from the second batch, I encountered the same issue of nan of grad_norm as show below.

{'loss': 2.2466, 'grad_norm': tensor(nan, device='cuda:0'), 'learning_rate': 0.0, 'epoch': 0.01}
  2%|▏         | 2/110 [01:28<1:15:24, 41.90s/it]
{'loss': 0.0, 'grad_norm': tensor(nan, device='cuda:0'), 'learning_rate': 1e-05, 'epoch': 0.02}

As you can see, the grad_norm and loss values become NaN early in the training process. I have tried reducing the learning rate, but the results remain similar. This leads me to suspect that there might be an issue with the dataset or the compatibility between FuseLLM and the OrionForCausalLM model.

Attempted Solutions:

I have attempted the following steps to address the issue:

Reduced the learning rate to various lower values, but the NaN values persist.
Checked the dataset for any potential contamination or irregularities, but I haven't found any obvious issues. Investigated the compatibility of FuseLLM with the OrionForCausalLM model, but I am unsure if there are any known issues or incompatibilities.

Request for Assistance:

I would greatly appreciate your guidance on the following aspects:

Are there any known compatibility issues between FuseLLM and the OrionForCausalLM model that could lead to NaN values during training?
Are there any specific considerations or modifications required when using FuseLLM with the OrionForCausalLM model?
Could you provide suggestions on how to debug and identify the root cause of the NaN values in grad_norm and loss?
Are there any recommended steps or techniques to stabilize the training process and prevent NaN values from occurring?

I would be grateful for any insights or advice you can offer to help me resolve this issue. I am keen on successfully fine-tuning the OrionForCausalLM model using FuseLLM and would appreciate your expertise in overcoming this obstacle.

Here is the link for the jupyter notebook that I have used on A100 x8: https://drive.google.com/file/d/1woAJvmJNhjF_abtZDOvo8MXXP54KVScr/view?usp=sharing

Thank you in advance for your time and assistance.

Hello @sigridjineth, could you please train OrionStarAI/Orion-14B-Base using the raw dataset and monitor the training loss?

@18907305772 what do you mean when using the raw dataset for Orion base? I am not the owner of the orion base so that I have not known what dataset has been used during the pre-training :(

I apologize for not describing this clearly, we need to first check the loss when we continue pre-training OrionStarAI/Orion-14B-Base directly with the Raw_koen_v2 (setting '--do_distill False'). This will allow us to determine if the loss nan is due to a problem with FuseLLM. Here is an example script.

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

deepspeed --master_port=20001 ./src/train.py \
  --training_mode full \
  --deepspeed ./config/zero_stage2_config.json \
  --model_name_or_path "<path_to_llama_2_7b>" \
  --output_dir "<path_to_save_fusellm_7b>" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
  --logging_strategy steps \
  --do_train \
  --do_eval \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "<path_to_tknzed_minipile>" \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --num_train_epochs 1 \
  --eval_steps 500 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-5 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn True \
  --report_to tensorboard 2>&1 | tee "<path_to_log_file>"

To obtain <path_to_tknzed_minipile>, you need to execute the following script.

python ./src/utils/tokenize_and_patch_dataset.py \
  --model_name_or_path "<path_to_llama_2_7b>" \
  --dataset "<path_to_minipile>" \
  --dataset_save_dir "<path_to_tknzed_minipile>" \
  --cache_dir "<path_to_cache_dir>" \
  --block_size 2048 \
  --preprocessing_num_workers 80 \
  --content_key "text"

It is recommended that you customize these scripts according to your specific settings.

@18907305772 hey, I have followed your instruction and found that continuing pretraining for Orion-14B-Base has no issue at the moment

{'loss': 2.3307, 'grad_norm': 1.305690641571091, 'learning_rate': 5.636491524084063e-06, 'epoch': 0.0}

  0%|          | 24/35006 [01:38<39:14:05,  4.04s/it]wandb: WARNING (User provided step: 200 is less than current step: 201. Dropping entry: {'Train/Samples/train_loss': 2.2003941535949707, '_timestamp': 1710246690.0791464}).
wandb: WARNING (User provided step: 210 is less than current step: 211. Dropping entry: {'Train/Samples/train_loss': 2.307175636291504, '_timestamp': 1710246694.1013865}).
wandb: WARNING (User provided step: 220 is less than current step: 221. Dropping entry: {'Train/Samples/train_loss': 2.2487869262695312, '_timestamp': 1710246698.1591682}).
wandb: WARNING (User provided step: 230 is less than current step: 231. Dropping entry: {'Train/Samples/train_loss': 2.2880337238311768, '_timestamp': 1710246702.1747313}).

  0%|          | 25/35006 [01:42<39:17:04,  4.04s/it]

{'loss': 2.2251, 'grad_norm': 1.2707274783657372, 'learning_rate': 5.7088920680623985e-06, 'epoch': 0.0}

  0%|          | 25/35006 [01:42<39:17:04,  4.04s/it]
  0%|          | 26/35006 [01:46<39:15:51,  4.04s/it]

{'loss': 2.2888, 'grad_norm': 1.2836060413004062, 'learning_rate': 5.778452632186889e-06, 'epoch': 0.0}

  0%|          | 26/35006 [01:46<39:15:51,  4.04s/it]
  0%|          | 27/35006 [01:50<39:17:07,  4.04s/it]

{'loss': 2.2439, 'grad_norm': 1.2832659312417514, 'learning_rate': 5.845387633966951e-06, 'epoch': 0.0}

  0%|          | 27/35006 [01:50<39:17:07,  4.04s/it]

The following is the code that I have ran as your suggestions to debug:

!export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

!python ./FuseLLM/FuseLLM/src/utils/tokenize_and_patch_dataset.py \
  --model_name_or_path "OrionStarAI/Orion-14B-Base" \
  --dataset "/home/sionic/sigrid/fusellm-test/datasets/Raw_koen_v2" \
  --dataset_save_dir "/home/sionic/sigrid/fusellm-test/datasets/patch/Raw_koen_v2" \
  --cache_dir "/home/sionic/sigrid/fusellm-test/cache_dir/patch/Raw_koen_v2" \
  --block_size 2048 \
  --preprocessing_num_workers 80 \
  --content_key "text"

from datasets import load_from_disk, DatasetDict

dataset = load_from_disk("/home/sionic/sigrid/fusellm-test/datasets/patch/Raw_koen_v2")

train_valid_split = dataset['train'].train_test_split(test_size=0.1)  # 10%를 valid로 사용
train_dataset = train_valid_split['train']
valid_dataset = train_valid_split['test']  # train_test_split에서 'test'가 valid 역할을 합니다.

new_dataset = DatasetDict({
    'train': train_dataset,
    'valid': valid_dataset
})

dataset_save_dir = "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch"

new_dataset.save_to_disk(dataset_save_dir)

!export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# https://blog.csdn.net/weixin_43013480/article/details/135674034

import os
get_ipython().system = os.system
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"

!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "OrionStarAI/Orion-14B-Base" \
  --output_dir "/home/sionic/sigrid/fusellm-test/models/output_small" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
  --logging_strategy steps \
  --do_train \
  --do_eval \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --num_train_epochs 1 \
  --eval_steps 500 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-5 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn True \
  --report_to wandb 2>&1 > ./240312-patch-small.txt 2>&1 &

see the wandb log here. Screenshot 2024-03-12 at 9 46 41 PM

Well, it seems like there is nothing wrong with only casual language model training. For FuseLLM training, I noticed that you did not include the --do_distill True parameter in your previous training script. Could you please execute the script again?

!deepspeed --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "beomi/llama-2-ko-7b" \
  --output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
  --logging_strategy steps \
  --do_train \
  --do_eval \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1" \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 32 \
  --num_train_epochs 1 \
  --eval_steps 500 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-7 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn False \
  --do_distill \
  --distill_with_ref_model True \
  --distill_with_aligned_model_0 True \
  --distill_with_aligned_model_1 True \
  --distill_loss_type "ce" \
  --distill_teacher_temperature 1.0 \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 10 \
  --report_to wandb \
  --remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output.log 2>&1

@18907305772 when running your command, I got this error.

!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "beomi/llama-2-ko-7b" \
  --output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
  --logging_strategy steps \
  --do_train \
  --do_eval \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 32 \
  --num_train_epochs 1 \
  --eval_steps 500 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-7 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn False \
  --do_distill \
  --distill_with_ref_model True \
  --distill_with_aligned_model_0 True \
  --distill_with_aligned_model_1 True \
  --distill_loss_type "ce" \
  --distill_teacher_temperature 1.0 \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 10 \
  --report_to wandb \
  --remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output.log 2>&1

Error Stack:

03/12/2024 21:08:54 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=1, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
  "do_sample": true,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 5
Process index: 1
Local process index: 1
Device: cuda:1
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=1), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f6ac88523e0>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f6ac88523e0>)
03/12/2024 21:08:54 - INFO - utils.others - Loading tokenizer.
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
    tokenizer, model = load_tokenizer_and_model(args)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
    tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
03/12/2024 21:08:54 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=4, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
  "do_sample": true,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 5
Process index: 4
Local process index: 4
Device: cuda:4
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=4), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f924ec82350>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f924ec82350>)
03/12/2024 21:08:54 - INFO - utils.others - Loading tokenizer.
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
    tokenizer, model = load_tokenizer_and_model(args)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
    tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
03/12/2024 21:08:54 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=0, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
  "do_sample": true,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 5
Process index: 0
Local process index: 0
Device: cuda:0
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=0), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f8a8d57e1a0>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f8a8d57e1a0>)
03/12/2024 21:08:54 - INFO - utils.others - Loading tokenizer.
03/12/2024 21:08:55 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=2, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
  "do_sample": true,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 5
Process index: 2
Local process index: 2
Device: cuda:2
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=2), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f14718ae5c0>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f14718ae5c0>)
03/12/2024 21:08:55 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=3, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
  "do_sample": true,
  "max_length": 4096,
  "temperature": 0.6,
  "top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED  Backend: nccl
Num processes: 5
Process index: 3
Local process index: 3
Device: cuda:3
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=3), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7fa70fa723b0>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7fa70fa723b0>)
03/12/2024 21:08:55 - INFO - utils.others - Loading tokenizer.
03/12/2024 21:08:55 - INFO - utils.others - Loading tokenizer.
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
    tokenizer, model = load_tokenizer_and_model(args)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
    tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
    tokenizer, model = load_tokenizer_and_model(args)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
    tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
[2024-03-12 21:08:55,545] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243285
[2024-03-12 21:08:55,614] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243286
[2024-03-12 21:08:55,629] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243287
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
    tokenizer, model = load_tokenizer_and_model(args)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
    tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
    tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
    return cls._from_pretrained(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
    self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
    with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
[2024-03-12 21:08:55,774] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243288
[2024-03-12 21:08:55,923] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243289
[2024-03-12 21:08:55,923] [ERROR] [launch.py:322:sigkill_handler] ['/home/sionic/.venv/bin/python', '-u', './FuseLLM/FuseLLM/src/train.py', '--local_rank=4', '--training_mode', 'full', '--deepspeed', '/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', '--model_name_or_path', 'beomi/llama-2-ko-7b', '--output_dir', '/home/sionic/sigrid/fusellm-test/models/output', '--model_max_length', '2048', '--logging_steps', '1', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--evaluation_strategy', 'steps', '--per_device_eval_batch_size', '1', '--logging_strategy', 'steps', '--do_train', '--do_eval', '--bf16', 'True', '--tf32', 'True', '--warmup_ratio', '0.008', '--lr_scheduler_type', 'cosine', '--dataset_name', '/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '32', '--num_train_epochs', '1', '--eval_steps', '500', '--optim', 'adamw_torch', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--learning_rate', '1e-7', '--weight_decay', '0.1', '--max_grad_norm', '1.0', '--seed', '42', '--gradient_checkpointing', 'True', '--use_flash_attn', 'False', '--do_distill', '--distill_with_ref_model', 'True', '--distill_with_aligned_model_0', 'True', '--distill_with_aligned_model_1', 'True', '--distill_loss_type', 'ce', '--distill_teacher_temperature', '1.0', '--lm_loss_weight', '0.9', '--distill_greater_as_gt', 'True', '--distill_greater_as_gt_type', 'hard', '--dataloader_num_workers', '10', '--report_to', 'wandb', '--remove_unused_columns', 'False'] exits with return code = 1

@18907305772 I changed the command little bit, but still enabling distill option might render a problem.

!export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7

# https://blog.csdn.net/weixin_43013480/article/details/135674034

import os
get_ipython().system = os.system
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"

!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "OrionStarAI/Orion-14B-Base" \
  --output_dir "/home/sionic/sigrid/fusellm-test/models/output_small" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
  --logging_strategy steps \
  --do_train \
  --do_eval \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
  --per_device_train_batch_size 1 \
  --gradient_accumulation_steps 16 \
  --num_train_epochs 1 \
  --do_distill True \
  --eval_steps 500 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-5 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn True \
  --report_to wandb 2>&1 > ./240312-patch-small.txt 2>&1 &

Error Stack


  0%|          | 0/4375 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
    train_result = trainer.train()
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
    base_seq_len = len(features["per_step_logits"][i])
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
    return self.data[item]
KeyError: 'per_step_logits'
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
    train_result = trainer.train()
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
    base_seq_len = len(features["per_step_logits"][i])
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
    return self.data[item]
KeyError: 'per_step_logits'
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
    train_result = trainer.train()
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
    current_batch = next(dataloader_iter)Traceback (most recent call last):

  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    data = self._next_data()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
    train_result = trainer.train()
      File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
    base_seq_len = len(features["per_step_logits"][i])
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
    return self.data[item]
KeyError    : return inner_training_loop('per_step_logits'

  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
    base_seq_len = len(features["per_step_logits"][i])
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
    return self.data[item]
KeyError: 'per_step_logits'
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
    train_result = trainer.train()
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
    base_seq_len = len(features["per_step_logits"][i])
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
    return self.data[item]
KeyError: 'per_step_logits'
[2024-03-12 21:15:22,632] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244460
[2024-03-12 21:15:22,862] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244461
[2024-03-12 21:15:22,878] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244462
[2024-03-12 21:15:22,878] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244463
[2024-03-12 21:15:22,892] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244464
[2024-03-12 21:15:22,905] [ERROR] [launch.py:322:sigkill_handler] ['/home/sionic/.venv/bin/python', '-u', './FuseLLM/FuseLLM/src/train.py', '--local_rank=4', '--training_mode', 'full', '--deepspeed', '/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', '--model_name_or_path', 'OrionStarAI/Orion-14B-Base', '--output_dir', '/home/sionic/sigrid/fusellm-test/models/output_small', '--model_max_length', '2048', '--logging_steps', '1', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--evaluation_strategy', 'steps', '--per_device_eval_batch_size', '1', '--logging_strategy', 'steps', '--do_train', '--do_eval', '--bf16', 'True', '--tf32', 'True', '--warmup_ratio', '0.008', '--lr_scheduler_type', 'cosine', '--dataset_name', '/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '16', '--num_train_epochs', '1', '--do_distill', 'True', '--eval_steps', '500', '--optim', 'adamw_torch', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--max_grad_norm', '1.0', '--seed', '42', '--gradient_checkpointing', 'True', '--use_flash_attn', 'True', '--report_to', 'wandb'] exits with return code = 1

For this error, https://github.com/18907305772/FuseLLM/issues/9#issuecomment-1991623712, you should change the code to use_fast=True

@18907305772

Base Model: beomi/llama-2-ko-7b (Dataset: the output of tokenize_and_patch_dataset.py)

Deepspeed Command

!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "beomi/llama-2-ko-7b" \
  --output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
  --logging_strategy steps \
  --do_train \
  --do_eval \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 32 \
  --num_train_epochs 1 \
  --eval_steps 500 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-7 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn False \
  --use_fast=True \
  --do_distill \
  --distill_with_ref_model True \
  --distill_with_aligned_model_0 True \
  --distill_with_aligned_model_1 True \
  --distill_loss_type "ce" \
  --distill_teacher_temperature 1.0 \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 10 \
  --report_to wandb \
  --remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output.log 2>&1

Thanks for letting me know. I have manually turned on use_fast for modifying get_tokenizer method in others.py

# get tokenizer
def get_tokenizer(model_name_or_path, cache_dir, model_max_length, use_fast):
    kwargs = {"use_fast": False, "tokenizer_trust_remote_code": False, "model_trust_remote_code": False}
    if "beomi" in model_name_or_path.lower():
        kwargs["use_fast"] = True
        kwargs["tokenizer_trust_remote_code"] = True
        kwargs["model_trust_remote_code"] = True
    elif "llama" in model_name_or_path.lower():
        kwargs["use_fast"] = False
        kwargs["tokenizer_trust_remote_code"] = False
        kwargs["model_trust_remote_code"] = False

but I am consistently getting key error for per_step_logits as shown here https://github.com/18907305772/FuseLLM/issues/9#issuecomment-1991632594

Error Stack

Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
    train_result = trainer.train()
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
    base_seq_len = len(features["per_step_logits"][i])
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
    return self.data[item]
KeyError: 'per_step_logits'

Base Model: OrionStarAI/Orion-14B-Base (Dataset: the output of tokenize_and_patch_dataset.py)

I am getting the same key error for per_step_logits to run like this for both base models - llama2-ko and orion.

Deepspeed Command

!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "OrionStarAI/Orion-14B-Base" \
  --output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
  --logging_strategy steps \
  --do_train \
  --do_eval \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 32 \
  --num_train_epochs 1 \
  --eval_steps 500 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-7 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn False \
  --use_fast=True \
  --do_distill \
  --distill_with_ref_model True \
  --distill_with_aligned_model_0 True \
  --distill_with_aligned_model_1 True \
  --distill_loss_type "ce" \
  --distill_teacher_temperature 1.0 \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 10 \
  --report_to wandb \
  --remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output_orion.log 2>&1

Error Stack

KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
    base_seq_len = len(features["per_step_logits"][i])
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
    return self.data[item]
KeyError: 'per_step_logits'

Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
    train_result = trainer.train()
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
    return inner_training_loop(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
    current_batch = next(dataloader_iter)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
    base_seq_len = len(features["per_step_logits"][i])
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
    return self.data[item]
KeyError: 'per_step_logits'

Please update the value of the --dataset_name parameter to 240311_dataset_1. This dataset contains per_step_logits, per_step_aligned_logits_0, and per_step_aligned_logits_1.

Here is the updated script.

!deepspeed --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "beomi/llama-2-ko-7b" \
  --output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --evaluation_strategy steps \
  --per_device_eval_batch_size 1 \
  --logging_strategy steps \
  --do_train \
  --do_eval \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1" \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 32 \
  --num_train_epochs 1 \
  --eval_steps 500 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-7 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --gradient_checkpointing True \
  --use_flash_attn False \
  --do_distill \
  --distill_with_ref_model True \
  --distill_with_aligned_model_0 True \
  --distill_with_aligned_model_1 True \
  --distill_loss_type "ce" \
  --distill_teacher_temperature 1.0 \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 10 \
  --report_to wandb \
  --remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output.log 2>&1

@18907305772 The base model that I have chosen is not llama2-7b but orion 14b which is based on the llama structure.

I have ran your above script with just small changes in base model to use --model_name_or_path "OrionStarAI/Orion-14B-Base" and got that:

Removed shared tensor {'model.layers.11.self_attn.k_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.30.self_attn.o_proj.weight', 'model.layers.34.mlp.up_proj.weight', 'model.layers.35.mlp.up_proj.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.33.self_attn.v_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.35.mlp.gate_proj.weight', 'model.layers.37.mlp.down_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.39.mlp.down_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.37.self_attn.o_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.36.mlp.up_proj.weight', 'model.layers.38.self_attn.q_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.33.mlp.gate_proj.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.38.self_attn.k_proj.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.32.self_attn.o_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.39.mlp.gate_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.32.self_attn.k_proj.weight', 'model.layers.32.self_attn.v_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.38.mlp.down_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.34.mlp.down_proj.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.31.mlp.gate_proj.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.36.self_attn.v_proj.weight', 'model.layers.37.mlp.gate_proj.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.31.mlp.up_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.33.self_attn.k_proj.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.34.self_attn.q_proj.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.38.mlp.up_proj.weight', 'model.layers.35.self_attn.v_proj.weight', 'model.layers.35.self_attn.k_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.35.mlp.down_proj.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.34.self_attn.o_proj.weight', 'model.layers.32.self_attn.q_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.33.self_attn.o_proj.weight', 'model.layers.39.self_attn.q_proj.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.33.mlp.up_proj.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.34.self_attn.v_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.38.self_attn.v_proj.weight', 'model.layers.36.self_attn.o_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.32.mlp.gate_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.38.mlp.gate_proj.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.39.mlp.up_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.39.self_attn.k_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.32.mlp.up_proj.weight', 'model.layers.32.mlp.down_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.39.self_attn.o_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.38.self_attn.o_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.34.mlp.gate_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.33.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.37.self_attn.k_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.35.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.39.self_attn.v_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.34.self_attn.k_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.36.self_attn.q_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.36.mlp.gate_proj.weight', 'model.layers.36.mlp.down_proj.weight', 'model.layers.37.self_attn.v_proj.weight', 'model.layers.33.mlp.down_proj.weight', 'model.layers.37.mlp.up_proj.weight', 'model.layers.35.self_attn.o_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.37.self_attn.q_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.36.self_attn.k_proj.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.12.mlp.down_proj.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
{'loss': 9.7239, 'grad_norm': 92.50677642180415, 'learning_rate': 0.0, 'epoch': 1.0}
{'train_runtime': 15.4474, 'train_samples_per_second': 1.877, 'train_steps_per_second': 0.065, 'train_loss': 9.723894119262695, 'epoch': 1.0}
***** train metrics *****
  epoch                    =        1.0
  train_loss               =     9.7239
  train_runtime            = 0:00:15.44
  train_samples_per_second =      1.877
  train_steps_per_second   =      0.065

but the train second is too short so that I haven't decided whether it works well or not

I found the same issue in other projects. Maybe this will help: https://github.com/huggingface/transformers/issues/27293#issuecomment-1815681831

@18907305772 okay, I found that even base model for orion works well in your suggestions (using tokenize_and_patch_dataset python code, but not blending)

It gives no NaN issues for me, so that I suspect that it is something not compatible with FuseLLM with Orion - but don't know the root cause of the problem.

Can you check the jupyter notebook below to debug?

https://drive.google.com/file/d/1ROj4F_FWsdaF6QGlEI2arMnBJ5P2xtWE/view?usp=sharing

Thanks for your help!

I suspect that there is something wrong with the transformers version, and safe_serialization parameter. You can use transformers==4.35.1 and safe_serialization=False following https://github.com/18907305772/FuseLLM/issues/9#issuecomment-1993546418. It seems that this issue may impact the process of saving the model, but it should not affect the training process.

@18907305772 okay, I am trying to retry the process at all, but got this issue:

Deepspeed Command

!deepspeed --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
  --training_mode full \
  --deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
  --model_name_or_path "OrionStarAI/Orion-14B-Base" \
  --output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
  --model_max_length 2048 \
  --logging_steps 1 \
  --save_strategy steps \
  --save_steps 500 \
  --save_total_limit 1 \
  --logging_strategy steps \
  --do_train \
  --bf16 True \
  --tf32 True \
  --warmup_ratio 0.008 \
  --lr_scheduler_type cosine \
  --dataset_name "/home/sionic/sigrid/fusellm-test/datasets/packing/240313_packing_set" \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 32 \
  --num_train_epochs 1 \
  --optim adamw_torch \
  --adam_beta1 0.9 \
  --adam_beta2 0.95 \
  --learning_rate 1e-7 \
  --weight_decay 0.1 \
  --max_grad_norm 1.0 \
  --seed 42 \
  --use_flash_attn True \
  --lm_loss_weight 0.9 \
  --distill_greater_as_gt True \
  --distill_greater_as_gt_type "hard" \
  --dataloader_num_workers 10 \
  --report_to wandb \
  --gradient_checkpointing False \
  --remove_unused_columns False \
  --safe_serialization False

Is there something I have missed during the training process?

[2024-03-13 17:54:42,122] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3362901
Traceback (most recent call last):
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
    train()
  File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
    train_result = trainer.train()
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
    return inner_training_loop(
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2748, in compute_loss
    outputs = model(**inputs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
TypeError: OrionForCausalLM.forward() got an unexpected keyword argument 'per_step_logits'

@18907305772 Changing base model to llama2-7b happens the same; TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'per_step_logits' This issue intermittently appears so that it is VERY difficult to debug.

The following is the dataset_info.json which was rendered after running packing.py...

{
  "citation": "",
  "description": "",
  "features": {
    "input_ids": {
      "feature": {
        "dtype": "int32",
        "_type": "Value"
      },
      "_type": "Sequence"
    },
    "attention_mask": {
      "feature": {
        "dtype": "int8",
        "_type": "Value"
      },
      "_type": "Sequence"
    },
    "labels": {
      "feature": {
        "dtype": "int64",
        "_type": "Value"
      },
      "_type": "Sequence"
    },
    "per_step_logits": {
      "feature": {
        "feature": {
          "dtype": "float16",
          "_type": "Value"
        },
        "_type": "Sequence"
      },
      "_type": "Sequence"
    },
    "per_step_indices": {
      "feature": {
        "feature": {
          "dtype": "int64",
          "_type": "Value"
        },
        "_type": "Sequence"
      },
      "_type": "Sequence"
    },
    "metric_ce": {
      "dtype": "float64",
      "_type": "Value"
    },
    "per_step_aligned_logits_0": {
      "feature": {
        "feature": {
          "dtype": "float64",
          "_type": "Value"
        },
        "_type": "Sequence"
      },
      "_type": "Sequence"
    },
    "per_step_aligned_indices_0": {
      "feature": {
        "feature": {
          "dtype": "int64",
          "_type": "Value"
        },
        "_type": "Sequence"
      },
      "_type": "Sequence"
    },
    "metric_ce_aligned_0": {
      "dtype": "float64",
      "_type": "Value"
    }
  },
  "homepage": "",
  "license": ""
}

If you use dataset with generated representations (per_step_logits), you need to set '--do_distill True' as the script in README (This will use DistillTrainer and start FuseLLM training). If you use tokenized dataset without generated representations（per_step_logits）, you need to set '--do_distill False' to start casual language model training (This will use original Trainer).

18907305772 / FuseAI