Closed sigridjineth closed 8 months ago
Hello @sigridjineth, could you please train OrionStarAI/Orion-14B-Base
using the raw dataset and monitor the training loss?
@18907305772 what do you mean when using the raw dataset for Orion base? I am not the owner of the orion base so that I have not known what dataset has been used during the pre-training :(
I apologize for not describing this clearly, we need to first check the loss when we continue pre-training OrionStarAI/Orion-14B-Base
directly with the Raw_koen_v2
(setting '--do_distill False'). This will allow us to determine if the loss nan
is due to a problem with FuseLLM.
Here is an example script.
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
deepspeed --master_port=20001 ./src/train.py \
--training_mode full \
--deepspeed ./config/zero_stage2_config.json \
--model_name_or_path "<path_to_llama_2_7b>" \
--output_dir "<path_to_save_fusellm_7b>" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
--logging_strategy steps \
--do_train \
--do_eval \
--bf16 True \
--tf32 True \
--warmup_ratio 0.008 \
--lr_scheduler_type cosine \
--dataset_name "<path_to_tknzed_minipile>" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--num_train_epochs 1 \
--eval_steps 500 \
--optim adamw_torch \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--seed 42 \
--gradient_checkpointing True \
--use_flash_attn True \
--report_to tensorboard 2>&1 | tee "<path_to_log_file>"
To obtain <path_to_tknzed_minipile>
, you need to execute the following script.
python ./src/utils/tokenize_and_patch_dataset.py \
--model_name_or_path "<path_to_llama_2_7b>" \
--dataset "<path_to_minipile>" \
--dataset_save_dir "<path_to_tknzed_minipile>" \
--cache_dir "<path_to_cache_dir>" \
--block_size 2048 \
--preprocessing_num_workers 80 \
--content_key "text"
It is recommended that you customize these scripts according to your specific settings.
@18907305772 hey, I have followed your instruction and found that continuing pretraining for Orion-14B-Base has no issue at the moment
{'loss': 2.3307, 'grad_norm': 1.305690641571091, 'learning_rate': 5.636491524084063e-06, 'epoch': 0.0}
0%| | 24/35006 [01:38<39:14:05, 4.04s/it]wandb: WARNING (User provided step: 200 is less than current step: 201. Dropping entry: {'Train/Samples/train_loss': 2.2003941535949707, '_timestamp': 1710246690.0791464}).
wandb: WARNING (User provided step: 210 is less than current step: 211. Dropping entry: {'Train/Samples/train_loss': 2.307175636291504, '_timestamp': 1710246694.1013865}).
wandb: WARNING (User provided step: 220 is less than current step: 221. Dropping entry: {'Train/Samples/train_loss': 2.2487869262695312, '_timestamp': 1710246698.1591682}).
wandb: WARNING (User provided step: 230 is less than current step: 231. Dropping entry: {'Train/Samples/train_loss': 2.2880337238311768, '_timestamp': 1710246702.1747313}).
0%| | 25/35006 [01:42<39:17:04, 4.04s/it]
{'loss': 2.2251, 'grad_norm': 1.2707274783657372, 'learning_rate': 5.7088920680623985e-06, 'epoch': 0.0}
0%| | 25/35006 [01:42<39:17:04, 4.04s/it]
0%| | 26/35006 [01:46<39:15:51, 4.04s/it]
{'loss': 2.2888, 'grad_norm': 1.2836060413004062, 'learning_rate': 5.778452632186889e-06, 'epoch': 0.0}
0%| | 26/35006 [01:46<39:15:51, 4.04s/it]
0%| | 27/35006 [01:50<39:17:07, 4.04s/it]
{'loss': 2.2439, 'grad_norm': 1.2832659312417514, 'learning_rate': 5.845387633966951e-06, 'epoch': 0.0}
0%| | 27/35006 [01:50<39:17:07, 4.04s/it]
The following is the code that I have ran as your suggestions to debug:
!export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
!python ./FuseLLM/FuseLLM/src/utils/tokenize_and_patch_dataset.py \
--model_name_or_path "OrionStarAI/Orion-14B-Base" \
--dataset "/home/sionic/sigrid/fusellm-test/datasets/Raw_koen_v2" \
--dataset_save_dir "/home/sionic/sigrid/fusellm-test/datasets/patch/Raw_koen_v2" \
--cache_dir "/home/sionic/sigrid/fusellm-test/cache_dir/patch/Raw_koen_v2" \
--block_size 2048 \
--preprocessing_num_workers 80 \
--content_key "text"
from datasets import load_from_disk, DatasetDict
dataset = load_from_disk("/home/sionic/sigrid/fusellm-test/datasets/patch/Raw_koen_v2")
train_valid_split = dataset['train'].train_test_split(test_size=0.1) # 10%를 valid로 사용
train_dataset = train_valid_split['train']
valid_dataset = train_valid_split['test'] # train_test_split에서 'test'가 valid 역할을 합니다.
new_dataset = DatasetDict({
'train': train_dataset,
'valid': valid_dataset
})
dataset_save_dir = "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch"
new_dataset.save_to_disk(dataset_save_dir)
!export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# https://blog.csdn.net/weixin_43013480/article/details/135674034
import os
get_ipython().system = os.system
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
--training_mode full \
--deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
--model_name_or_path "OrionStarAI/Orion-14B-Base" \
--output_dir "/home/sionic/sigrid/fusellm-test/models/output_small" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
--logging_strategy steps \
--do_train \
--do_eval \
--bf16 True \
--tf32 True \
--warmup_ratio 0.008 \
--lr_scheduler_type cosine \
--dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--num_train_epochs 1 \
--eval_steps 500 \
--optim adamw_torch \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--seed 42 \
--gradient_checkpointing True \
--use_flash_attn True \
--report_to wandb 2>&1 > ./240312-patch-small.txt 2>&1 &
see the wandb log here.
Well, it seems like there is nothing wrong with only casual language model training. For FuseLLM training, I noticed that you did not include the --do_distill True
parameter in your previous training script. Could you please execute the script again?
!deepspeed --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
--training_mode full \
--deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
--model_name_or_path "beomi/llama-2-ko-7b" \
--output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
--logging_strategy steps \
--do_train \
--do_eval \
--bf16 True \
--tf32 True \
--warmup_ratio 0.008 \
--lr_scheduler_type cosine \
--dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 32 \
--num_train_epochs 1 \
--eval_steps 500 \
--optim adamw_torch \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--learning_rate 1e-7 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--seed 42 \
--gradient_checkpointing True \
--use_flash_attn False \
--do_distill \
--distill_with_ref_model True \
--distill_with_aligned_model_0 True \
--distill_with_aligned_model_1 True \
--distill_loss_type "ce" \
--distill_teacher_temperature 1.0 \
--lm_loss_weight 0.9 \
--distill_greater_as_gt True \
--distill_greater_as_gt_type "hard" \
--dataloader_num_workers 10 \
--report_to wandb \
--remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output.log 2>&1
@18907305772 when running your command, I got this error.
!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
--training_mode full \
--deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
--model_name_or_path "beomi/llama-2-ko-7b" \
--output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
--logging_strategy steps \
--do_train \
--do_eval \
--bf16 True \
--tf32 True \
--warmup_ratio 0.008 \
--lr_scheduler_type cosine \
--dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 32 \
--num_train_epochs 1 \
--eval_steps 500 \
--optim adamw_torch \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--learning_rate 1e-7 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--seed 42 \
--gradient_checkpointing True \
--use_flash_attn False \
--do_distill \
--distill_with_ref_model True \
--distill_with_aligned_model_0 True \
--distill_with_aligned_model_1 True \
--distill_loss_type "ce" \
--distill_teacher_temperature 1.0 \
--lm_loss_weight 0.9 \
--distill_greater_as_gt True \
--distill_greater_as_gt_type "hard" \
--dataloader_num_workers 10 \
--report_to wandb \
--remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output.log 2>&1
03/12/2024 21:08:54 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=1, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
"do_sample": true,
"max_length": 4096,
"temperature": 0.6,
"top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED Backend: nccl
Num processes: 5
Process index: 1
Local process index: 1
Device: cuda:1
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=1), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f6ac88523e0>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f6ac88523e0>)
03/12/2024 21:08:54 - INFO - utils.others - Loading tokenizer.
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
tokenizer, model = load_tokenizer_and_model(args)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
return cls._from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
03/12/2024 21:08:54 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=4, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
"do_sample": true,
"max_length": 4096,
"temperature": 0.6,
"top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED Backend: nccl
Num processes: 5
Process index: 4
Local process index: 4
Device: cuda:4
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=4), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f924ec82350>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f924ec82350>)
03/12/2024 21:08:54 - INFO - utils.others - Loading tokenizer.
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
tokenizer, model = load_tokenizer_and_model(args)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
return cls._from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
03/12/2024 21:08:54 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=0, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
"do_sample": true,
"max_length": 4096,
"temperature": 0.6,
"top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED Backend: nccl
Num processes: 5
Process index: 0
Local process index: 0
Device: cuda:0
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=0), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f8a8d57e1a0>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f8a8d57e1a0>)
03/12/2024 21:08:54 - INFO - utils.others - Loading tokenizer.
03/12/2024 21:08:55 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=2, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
"do_sample": true,
"max_length": 4096,
"temperature": 0.6,
"top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED Backend: nccl
Num processes: 5
Process index: 2
Local process index: 2
Device: cuda:2
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=2), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f14718ae5c0>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7f14718ae5c0>)
03/12/2024 21:08:55 - INFO - utils.common - Training/Evaluation Args: Namespace(model_name_or_path='beomi/llama-2-ko-7b', dataset_name='/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', max_train_samples=None, max_eval_samples=None, max_predict_samples=None, overwrite_cache=False, preprocessing_num_workers=64, output_dir='/home/sionic/sigrid/fusellm-test/models/output', overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<IntervalStrategy.STEPS: 'steps'>, prediction_loss_only=False, per_device_train_batch_size=2, per_device_eval_batch_size=1, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=32, eval_accumulation_steps=None, eval_delay=0, learning_rate=1e-07, weight_decay=0.1, adam_beta1=0.9, adam_beta2=0.95, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.COSINE: 'cosine'>, lr_scheduler_kwargs={}, warmup_ratio=0.008, warmup_steps=0, log_level='passive', log_level_replica='warning', log_on_each_node=True, logging_dir='/home/sionic/sigrid/fusellm-test/models/output/runs/Mar12_21-08-52_iZmj7ir0ircgij46j89st9Z', logging_strategy=<IntervalStrategy.STEPS: 'steps'>, logging_first_step=False, logging_steps=1.0, logging_nan_inf_filter=True, save_strategy=<IntervalStrategy.STEPS: 'steps'>, save_steps=500, save_total_limit=1, save_safetensors=True, save_on_each_node=False, save_only_model=False, no_cuda=False, use_cpu=False, use_mps_device=False, seed=42, data_seed=None, jit_mode_eval=False, use_ipex=False, bf16=True, fp16=False, fp16_opt_level='O1', half_precision_backend='auto', bf16_full_eval=False, fp16_full_eval=False, tf32=True, local_rank=3, ddp_backend=None, tpu_num_cores=None, tpu_metrics_debug=False, debug=[], dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=10, dataloader_prefetch_factor=None, past_index=-1, run_name='/home/sionic/sigrid/fusellm-test/models/output', disable_tqdm=False, remove_unused_columns=False, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, fsdp=[], fsdp_min_num_params=0, fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}, fsdp_transformer_layer_cls_to_wrap=None, accelerator_config=AcceleratorConfig(split_batches=False, dispatch_batches=None, even_batches=True, use_seedable_sampler=True), deepspeed='/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', label_smoothing_factor=0.0, optim='adamw_torch', optim_args=None, adafactor=False, group_by_length=False, length_column_name='length', report_to=['wandb'], ddp_find_unused_parameters=None, ddp_bucket_cap_mb=None, ddp_broadcast_buffers=None, dataloader_pin_memory=True, dataloader_persistent_workers=False, skip_memory_metrics=True, use_legacy_prediction_loop=False, push_to_hub=False, resume_from_checkpoint=None, hub_model_id=None, hub_strategy=<HubStrategy.EVERY_SAVE: 'every_save'>, hub_token=None, hub_private_repo=False, hub_always_push=False, gradient_checkpointing=True, gradient_checkpointing_kwargs=None, include_inputs_for_metrics=False, fp16_backend='auto', push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=None, mp_parameters='', auto_find_batch_size=False, full_determinism=False, torchdynamo=None, ray_scope='last', ddp_timeout=1800, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, dispatch_batches=None, split_batches=None, include_tokens_per_second=False, include_num_input_tokens_seen=False, neftune_noise_alpha=None, sortish_sampler=False, predict_with_generate=False, generation_max_length=None, generation_num_beams=None, generation_config=GenerationConfig {
"do_sample": true,
"max_length": 4096,
"temperature": 0.6,
"top_p": 0.9
}
, training_mode='full', use_flash_attn=False, cache_dir=None, model_max_length=2048, adam8bit=False, double_quant=True, quant_type='nf4', bits=4, lora_r=64, lora_alpha=16, lora_dropout=0.0, max_memory_MB=40000, do_distill=True, distill_with_ref_model=True, distill_with_aligned_model_0=True, distill_with_aligned_model_1=True, distill_loss_type='ce', distill_teacher_temperature=1.0, lm_loss_weight=0.9, distill_greater_as_gt=True, distill_greater_as_gt_type='hard', distill_weighted_as_gt=False, distill_weighted_as_gt_type='hard', distributed_state=Distributed environment: DEEPSPEED Backend: nccl
Num processes: 5
Process index: 3
Local process index: 3
Device: cuda:3
, _n_gpu=1, __cached__setup_devices=device(type='cuda', index=3), deepspeed_plugin=DeepSpeedPlugin(hf_ds_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7fa70fa723b0>, gradient_accumulation_steps=32, gradient_clipping=1.0, zero_stage=3, is_train_batch_min=True, offload_optimizer_device='none', offload_param_device='none', offload_optimizer_nvme_path='none', offload_param_nvme_path='none', zero3_init_flag=True, zero3_save_16bit_model=False), hf_deepspeed_config=<transformers.integrations.deepspeed.HfTrainerDeepSpeedConfig object at 0x7fa70fa723b0>)
03/12/2024 21:08:55 - INFO - utils.others - Loading tokenizer.
03/12/2024 21:08:55 - INFO - utils.others - Loading tokenizer.
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
tokenizer, model = load_tokenizer_and_model(args)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
return cls._from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
tokenizer, model = load_tokenizer_and_model(args)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
return cls._from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
[2024-03-12 21:08:55,545] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243285
[2024-03-12 21:08:55,614] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243286
[2024-03-12 21:08:55,629] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243287
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 40, in train
tokenizer, model = load_tokenizer_and_model(args)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/common.py", line 43, in load_tokenizer_and_model
tokenizer, kwargs = get_tokenizer(args.model_name_or_path, args.cache_dir, args.model_max_length)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/others.py", line 69, in get_tokenizer
tokenizer = transformers.AutoTokenizer.from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 825, in from_pretrained
return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2048, in from_pretrained
return cls._from_pretrained(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2287, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 182, in __init__
self.sp_model = self.get_spm_processor(kwargs.pop("from_slow", False))
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/models/llama/tokenization_llama.py", line 212, in get_spm_processor
with open(self.vocab_file, "rb") as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType
[2024-03-12 21:08:55,774] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243288
[2024-03-12 21:08:55,923] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3243289
[2024-03-12 21:08:55,923] [ERROR] [launch.py:322:sigkill_handler] ['/home/sionic/.venv/bin/python', '-u', './FuseLLM/FuseLLM/src/train.py', '--local_rank=4', '--training_mode', 'full', '--deepspeed', '/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', '--model_name_or_path', 'beomi/llama-2-ko-7b', '--output_dir', '/home/sionic/sigrid/fusellm-test/models/output', '--model_max_length', '2048', '--logging_steps', '1', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--evaluation_strategy', 'steps', '--per_device_eval_batch_size', '1', '--logging_strategy', 'steps', '--do_train', '--do_eval', '--bf16', 'True', '--tf32', 'True', '--warmup_ratio', '0.008', '--lr_scheduler_type', 'cosine', '--dataset_name', '/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1', '--per_device_train_batch_size', '2', '--gradient_accumulation_steps', '32', '--num_train_epochs', '1', '--eval_steps', '500', '--optim', 'adamw_torch', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--learning_rate', '1e-7', '--weight_decay', '0.1', '--max_grad_norm', '1.0', '--seed', '42', '--gradient_checkpointing', 'True', '--use_flash_attn', 'False', '--do_distill', '--distill_with_ref_model', 'True', '--distill_with_aligned_model_0', 'True', '--distill_with_aligned_model_1', 'True', '--distill_loss_type', 'ce', '--distill_teacher_temperature', '1.0', '--lm_loss_weight', '0.9', '--distill_greater_as_gt', 'True', '--distill_greater_as_gt_type', 'hard', '--dataloader_num_workers', '10', '--report_to', 'wandb', '--remove_unused_columns', 'False'] exits with return code = 1
@18907305772 I changed the command little bit, but still enabling distill option might render a problem.
!export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
# https://blog.csdn.net/weixin_43013480/article/details/135674034
import os
get_ipython().system = os.system
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3,4,5,6,7"
!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
--training_mode full \
--deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
--model_name_or_path "OrionStarAI/Orion-14B-Base" \
--output_dir "/home/sionic/sigrid/fusellm-test/models/output_small" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
--logging_strategy steps \
--do_train \
--do_eval \
--bf16 True \
--tf32 True \
--warmup_ratio 0.008 \
--lr_scheduler_type cosine \
--dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 16 \
--num_train_epochs 1 \
--do_distill True \
--eval_steps 500 \
--optim adamw_torch \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--learning_rate 1e-5 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--seed 42 \
--gradient_checkpointing True \
--use_flash_attn True \
--report_to wandb 2>&1 > ./240312-patch-small.txt 2>&1 &
0%| | 0/4375 [00:00<?, ?it/s]Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
train_result = trainer.train()
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
current_batch = next(dataloader_iter)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
base_seq_len = len(features["per_step_logits"][i])
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
return self.data[item]
KeyError: 'per_step_logits'
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
train_result = trainer.train()
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
current_batch = next(dataloader_iter)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
base_seq_len = len(features["per_step_logits"][i])
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
return self.data[item]
KeyError: 'per_step_logits'
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
train_result = trainer.train()
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
current_batch = next(dataloader_iter)Traceback (most recent call last):
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
data = self._next_data()
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
train_result = trainer.train()
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
base_seq_len = len(features["per_step_logits"][i])
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
return self.data[item]
KeyError : return inner_training_loop('per_step_logits'
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
current_batch = next(dataloader_iter)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
base_seq_len = len(features["per_step_logits"][i])
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
return self.data[item]
KeyError: 'per_step_logits'
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
train_result = trainer.train()
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
current_batch = next(dataloader_iter)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
base_seq_len = len(features["per_step_logits"][i])
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
return self.data[item]
KeyError: 'per_step_logits'
[2024-03-12 21:15:22,632] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244460
[2024-03-12 21:15:22,862] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244461
[2024-03-12 21:15:22,878] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244462
[2024-03-12 21:15:22,878] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244463
[2024-03-12 21:15:22,892] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3244464
[2024-03-12 21:15:22,905] [ERROR] [launch.py:322:sigkill_handler] ['/home/sionic/.venv/bin/python', '-u', './FuseLLM/FuseLLM/src/train.py', '--local_rank=4', '--training_mode', 'full', '--deepspeed', '/home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json', '--model_name_or_path', 'OrionStarAI/Orion-14B-Base', '--output_dir', '/home/sionic/sigrid/fusellm-test/models/output_small', '--model_max_length', '2048', '--logging_steps', '1', '--save_strategy', 'steps', '--save_steps', '500', '--save_total_limit', '1', '--evaluation_strategy', 'steps', '--per_device_eval_batch_size', '1', '--logging_strategy', 'steps', '--do_train', '--do_eval', '--bf16', 'True', '--tf32', 'True', '--warmup_ratio', '0.008', '--lr_scheduler_type', 'cosine', '--dataset_name', '/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch', '--per_device_train_batch_size', '1', '--gradient_accumulation_steps', '16', '--num_train_epochs', '1', '--do_distill', 'True', '--eval_steps', '500', '--optim', 'adamw_torch', '--adam_beta1', '0.9', '--adam_beta2', '0.95', '--learning_rate', '1e-5', '--weight_decay', '0.1', '--max_grad_norm', '1.0', '--seed', '42', '--gradient_checkpointing', 'True', '--use_flash_attn', 'True', '--report_to', 'wandb'] exits with return code = 1
For this error, https://github.com/18907305772/FuseLLM/issues/9#issuecomment-1991623712, you should change the code to use_fast=True
@18907305772
!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
--training_mode full \
--deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
--model_name_or_path "beomi/llama-2-ko-7b" \
--output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
--logging_strategy steps \
--do_train \
--do_eval \
--bf16 True \
--tf32 True \
--warmup_ratio 0.008 \
--lr_scheduler_type cosine \
--dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 32 \
--num_train_epochs 1 \
--eval_steps 500 \
--optim adamw_torch \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--learning_rate 1e-7 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--seed 42 \
--gradient_checkpointing True \
--use_flash_attn False \
--use_fast=True \
--do_distill \
--distill_with_ref_model True \
--distill_with_aligned_model_0 True \
--distill_with_aligned_model_1 True \
--distill_loss_type "ce" \
--distill_teacher_temperature 1.0 \
--lm_loss_weight 0.9 \
--distill_greater_as_gt True \
--distill_greater_as_gt_type "hard" \
--dataloader_num_workers 10 \
--report_to wandb \
--remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output.log 2>&1
Thanks for letting me know. I have manually turned on use_fast
for modifying get_tokenizer
method in others.py
# get tokenizer
def get_tokenizer(model_name_or_path, cache_dir, model_max_length, use_fast):
kwargs = {"use_fast": False, "tokenizer_trust_remote_code": False, "model_trust_remote_code": False}
if "beomi" in model_name_or_path.lower():
kwargs["use_fast"] = True
kwargs["tokenizer_trust_remote_code"] = True
kwargs["model_trust_remote_code"] = True
elif "llama" in model_name_or_path.lower():
kwargs["use_fast"] = False
kwargs["tokenizer_trust_remote_code"] = False
kwargs["model_trust_remote_code"] = False
but I am consistently getting key error for per_step_logits
as shown here https://github.com/18907305772/FuseLLM/issues/9#issuecomment-1991632594
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
train_result = trainer.train()
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
current_batch = next(dataloader_iter)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
base_seq_len = len(features["per_step_logits"][i])
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
return self.data[item]
KeyError: 'per_step_logits'
I am getting the same key error for per_step_logits
to run like this for both base models - llama2-ko and orion.
!deepspeed --include localhost:3,4,5,6,7 --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
--training_mode full \
--deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
--model_name_or_path "OrionStarAI/Orion-14B-Base" \
--output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
--logging_strategy steps \
--do_train \
--do_eval \
--bf16 True \
--tf32 True \
--warmup_ratio 0.008 \
--lr_scheduler_type cosine \
--dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240312_dataset_patch" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 32 \
--num_train_epochs 1 \
--eval_steps 500 \
--optim adamw_torch \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--learning_rate 1e-7 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--seed 42 \
--gradient_checkpointing True \
--use_flash_attn False \
--use_fast=True \
--do_distill \
--distill_with_ref_model True \
--distill_with_aligned_model_0 True \
--distill_with_aligned_model_1 True \
--distill_loss_type "ce" \
--distill_teacher_temperature 1.0 \
--lm_loss_weight 0.9 \
--distill_greater_as_gt True \
--distill_greater_as_gt_type "hard" \
--dataloader_num_workers 10 \
--report_to wandb \
--remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output_orion.log 2>&1
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
base_seq_len = len(features["per_step_logits"][i])
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
return self.data[item]
KeyError: 'per_step_logits'
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
train_result = trainer.train()
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1624, in train
return inner_training_loop(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1928, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "/home/sionic/.venv/lib/python3.10/site-packages/accelerate/data_loader.py", line 452, in __iter__
current_batch = next(dataloader_iter)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
data = self._next_data()
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/_utils.py", line 694, in reraise
raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
return self.collate_fn(data)
File "/home/sionic/sigrid/FuseLLM/FuseLLM/src/utils/data_collator.py", line 236, in __call__
base_seq_len = len(features["per_step_logits"][i])
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 253, in __getitem__
return self.data[item]
KeyError: 'per_step_logits'
Please update the value of the --dataset_name
parameter to 240311_dataset_1
. This dataset contains per_step_logits
, per_step_aligned_logits_0
, and per_step_aligned_logits_1
.
Here is the updated script.
!deepspeed --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
--training_mode full \
--deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
--model_name_or_path "beomi/llama-2-ko-7b" \
--output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--evaluation_strategy steps \
--per_device_eval_batch_size 1 \
--logging_strategy steps \
--do_train \
--do_eval \
--bf16 True \
--tf32 True \
--warmup_ratio 0.008 \
--lr_scheduler_type cosine \
--dataset_name "/home/sionic/sigrid/fusellm-test/datasets/final/240311_dataset_1" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 32 \
--num_train_epochs 1 \
--eval_steps 500 \
--optim adamw_torch \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--learning_rate 1e-7 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--seed 42 \
--gradient_checkpointing True \
--use_flash_attn False \
--do_distill \
--distill_with_ref_model True \
--distill_with_aligned_model_0 True \
--distill_with_aligned_model_1 True \
--distill_loss_type "ce" \
--distill_teacher_temperature 1.0 \
--lm_loss_weight 0.9 \
--distill_greater_as_gt True \
--distill_greater_as_gt_type "hard" \
--dataloader_num_workers 10 \
--report_to wandb \
--remove_unused_columns False > /home/sionic/sigrid/fusellm-test/logs/training_output.log 2>&1
@18907305772 The base model that I have chosen is not llama2-7b but orion 14b which is based on the llama structure.
I have ran your above script with just small changes in base model to use --model_name_or_path "OrionStarAI/Orion-14B-Base"
and got that:
Removed shared tensor {'model.layers.11.self_attn.k_proj.weight', 'model.layers.18.self_attn.q_proj.weight', 'model.layers.0.self_attn.v_proj.weight', 'model.layers.1.self_attn.q_proj.weight', 'model.layers.9.self_attn.q_proj.weight', 'model.layers.11.self_attn.v_proj.weight', 'model.layers.26.self_attn.q_proj.weight', 'model.layers.26.mlp.gate_proj.weight', 'model.layers.30.self_attn.o_proj.weight', 'model.layers.34.mlp.up_proj.weight', 'model.layers.35.mlp.up_proj.weight', 'model.layers.2.self_attn.k_proj.weight', 'model.layers.20.self_attn.q_proj.weight', 'model.layers.5.self_attn.o_proj.weight', 'model.layers.4.mlp.up_proj.weight', 'model.layers.6.mlp.gate_proj.weight', 'model.layers.24.self_attn.o_proj.weight', 'model.layers.5.self_attn.k_proj.weight', 'model.layers.33.self_attn.v_proj.weight', 'model.layers.29.self_attn.v_proj.weight', 'model.layers.35.mlp.gate_proj.weight', 'model.layers.37.mlp.down_proj.weight', 'model.layers.15.mlp.up_proj.weight', 'model.layers.1.self_attn.v_proj.weight', 'model.layers.39.mlp.down_proj.weight', 'model.layers.3.self_attn.v_proj.weight', 'model.layers.17.self_attn.o_proj.weight', 'model.layers.11.mlp.gate_proj.weight', 'model.layers.0.self_attn.q_proj.weight', 'model.layers.37.self_attn.o_proj.weight', 'model.layers.21.mlp.up_proj.weight', 'model.layers.26.self_attn.k_proj.weight', 'model.layers.36.mlp.up_proj.weight', 'model.layers.38.self_attn.q_proj.weight', 'model.layers.20.self_attn.o_proj.weight', 'model.layers.28.mlp.up_proj.weight', 'model.layers.33.mlp.gate_proj.weight', 'model.layers.16.mlp.gate_proj.weight', 'model.layers.38.self_attn.k_proj.weight', 'model.layers.6.self_attn.k_proj.weight', 'model.layers.32.self_attn.o_proj.weight', 'model.layers.8.self_attn.v_proj.weight', 'model.layers.12.self_attn.k_proj.weight', 'model.layers.26.self_attn.v_proj.weight', 'model.layers.39.mlp.gate_proj.weight', 'model.layers.24.self_attn.k_proj.weight', 'model.layers.32.self_attn.k_proj.weight', 'model.layers.32.self_attn.v_proj.weight', 'model.layers.11.self_attn.q_proj.weight', 'model.layers.13.self_attn.q_proj.weight', 'model.layers.19.self_attn.v_proj.weight', 'model.layers.5.self_attn.q_proj.weight', 'model.layers.38.mlp.down_proj.weight', 'model.layers.21.self_attn.k_proj.weight', 'model.layers.31.self_attn.q_proj.weight', 'model.layers.18.mlp.up_proj.weight', 'model.layers.12.self_attn.o_proj.weight', 'model.layers.26.self_attn.o_proj.weight', 'model.layers.27.mlp.down_proj.weight', 'model.layers.7.self_attn.k_proj.weight', 'model.layers.15.self_attn.o_proj.weight', 'model.layers.2.self_attn.o_proj.weight', 'model.layers.0.mlp.down_proj.weight', 'model.layers.21.self_attn.o_proj.weight', 'model.layers.1.mlp.gate_proj.weight', 'model.layers.7.mlp.up_proj.weight', 'model.layers.27.self_attn.k_proj.weight', 'model.layers.27.mlp.up_proj.weight', 'model.layers.34.mlp.down_proj.weight', 'model.layers.13.self_attn.k_proj.weight', 'model.layers.31.mlp.gate_proj.weight', 'model.layers.29.mlp.down_proj.weight', 'model.layers.30.self_attn.k_proj.weight', 'model.layers.1.self_attn.k_proj.weight', 'model.layers.4.self_attn.q_proj.weight', 'model.layers.36.self_attn.v_proj.weight', 'model.layers.37.mlp.gate_proj.weight', 'model.layers.9.mlp.up_proj.weight', 'model.layers.13.mlp.up_proj.weight', 'model.layers.22.mlp.up_proj.weight', 'model.layers.23.mlp.gate_proj.weight', 'model.layers.23.mlp.down_proj.weight', 'model.layers.4.mlp.down_proj.weight', 'model.layers.15.self_attn.k_proj.weight', 'model.layers.17.mlp.up_proj.weight', 'model.layers.27.self_attn.o_proj.weight', 'model.layers.28.self_attn.k_proj.weight', 'model.layers.9.mlp.gate_proj.weight', 'model.layers.17.self_attn.k_proj.weight', 'model.layers.9.mlp.down_proj.weight', 'model.layers.2.mlp.up_proj.weight', 'model.layers.4.self_attn.o_proj.weight', 'model.layers.31.mlp.up_proj.weight', 'model.layers.4.mlp.gate_proj.weight', 'model.layers.28.self_attn.v_proj.weight', 'model.layers.3.self_attn.k_proj.weight', 'model.layers.18.mlp.gate_proj.weight', 'model.layers.25.mlp.up_proj.weight', 'model.layers.20.mlp.gate_proj.weight', 'model.layers.14.self_attn.k_proj.weight', 'model.layers.19.self_attn.k_proj.weight', 'model.layers.33.self_attn.k_proj.weight', 'model.layers.5.mlp.down_proj.weight', 'model.layers.30.mlp.down_proj.weight', 'model.layers.34.self_attn.q_proj.weight', 'model.layers.2.self_attn.q_proj.weight', 'model.layers.21.self_attn.q_proj.weight', 'model.layers.7.self_attn.o_proj.weight', 'model.layers.31.self_attn.k_proj.weight', 'model.layers.8.mlp.up_proj.weight', 'model.layers.23.self_attn.q_proj.weight', 'model.layers.23.self_attn.v_proj.weight', 'model.layers.25.self_attn.o_proj.weight', 'model.layers.22.mlp.down_proj.weight', 'model.layers.30.mlp.gate_proj.weight', 'model.layers.21.mlp.down_proj.weight', 'model.layers.6.self_attn.v_proj.weight', 'model.layers.38.mlp.up_proj.weight', 'model.layers.35.self_attn.v_proj.weight', 'model.layers.35.self_attn.k_proj.weight', 'model.layers.8.self_attn.o_proj.weight', 'model.layers.10.self_attn.o_proj.weight', 'model.layers.16.self_attn.o_proj.weight', 'model.layers.8.mlp.down_proj.weight', 'model.layers.35.mlp.down_proj.weight', 'model.layers.17.mlp.down_proj.weight', 'model.layers.34.self_attn.o_proj.weight', 'model.layers.32.self_attn.q_proj.weight', 'model.layers.26.mlp.up_proj.weight', 'model.layers.33.self_attn.o_proj.weight', 'model.layers.39.self_attn.q_proj.weight', 'model.layers.10.mlp.down_proj.weight', 'model.layers.27.mlp.gate_proj.weight', 'model.layers.14.self_attn.o_proj.weight', 'model.layers.15.mlp.down_proj.weight', 'model.layers.5.mlp.gate_proj.weight', 'model.layers.14.mlp.gate_proj.weight', 'model.layers.16.self_attn.q_proj.weight', 'model.layers.7.mlp.down_proj.weight', 'model.layers.26.mlp.down_proj.weight', 'model.layers.33.mlp.up_proj.weight', 'model.layers.14.mlp.up_proj.weight', 'model.layers.1.mlp.down_proj.weight', 'model.layers.20.self_attn.k_proj.weight', 'model.layers.20.mlp.down_proj.weight', 'model.layers.20.self_attn.v_proj.weight', 'model.layers.31.self_attn.v_proj.weight', 'model.layers.34.self_attn.v_proj.weight', 'model.layers.15.self_attn.v_proj.weight', 'model.layers.7.mlp.gate_proj.weight', 'model.layers.19.self_attn.o_proj.weight', 'model.layers.4.self_attn.v_proj.weight', 'model.layers.25.self_attn.k_proj.weight', 'model.layers.30.self_attn.q_proj.weight', 'model.layers.9.self_attn.k_proj.weight', 'model.layers.11.self_attn.o_proj.weight', 'model.layers.7.self_attn.v_proj.weight', 'model.layers.19.mlp.gate_proj.weight', 'model.layers.2.mlp.down_proj.weight', 'model.layers.7.self_attn.q_proj.weight', 'model.layers.8.self_attn.k_proj.weight', 'model.layers.30.self_attn.v_proj.weight', 'model.layers.14.self_attn.q_proj.weight', 'model.layers.24.mlp.up_proj.weight', 'model.layers.31.mlp.down_proj.weight', 'model.layers.28.self_attn.q_proj.weight', 'model.layers.38.self_attn.v_proj.weight', 'model.layers.36.self_attn.o_proj.weight', 'model.layers.21.mlp.gate_proj.weight', 'model.layers.32.mlp.gate_proj.weight', 'model.layers.25.self_attn.q_proj.weight', 'model.layers.17.mlp.gate_proj.weight', 'model.layers.27.self_attn.q_proj.weight', 'model.layers.6.mlp.down_proj.weight', 'model.layers.38.mlp.gate_proj.weight', 'model.layers.23.self_attn.k_proj.weight', 'model.layers.13.self_attn.o_proj.weight', 'model.layers.22.mlp.gate_proj.weight', 'model.layers.13.mlp.down_proj.weight', 'model.layers.29.mlp.up_proj.weight', 'model.layers.27.self_attn.v_proj.weight', 'model.layers.18.self_attn.v_proj.weight', 'model.layers.12.mlp.gate_proj.weight', 'model.layers.39.mlp.up_proj.weight', 'model.layers.10.self_attn.q_proj.weight', 'model.layers.14.self_attn.v_proj.weight', 'model.layers.39.self_attn.k_proj.weight', 'model.layers.2.mlp.gate_proj.weight', 'model.layers.4.self_attn.k_proj.weight', 'model.layers.19.mlp.up_proj.weight', 'model.layers.16.self_attn.v_proj.weight', 'model.layers.16.self_attn.k_proj.weight', 'model.layers.12.mlp.up_proj.weight', 'model.layers.24.self_attn.q_proj.weight', 'model.layers.1.self_attn.o_proj.weight', 'model.layers.32.mlp.up_proj.weight', 'model.layers.32.mlp.down_proj.weight', 'model.layers.18.self_attn.o_proj.weight', 'model.layers.25.mlp.down_proj.weight', 'model.layers.39.self_attn.o_proj.weight', 'model.layers.12.self_attn.v_proj.weight', 'model.layers.29.self_attn.q_proj.weight', 'model.layers.38.self_attn.o_proj.weight', 'model.layers.17.self_attn.v_proj.weight', 'model.layers.17.self_attn.q_proj.weight', 'model.layers.3.mlp.down_proj.weight', 'model.layers.18.mlp.down_proj.weight', 'model.layers.0.mlp.gate_proj.weight', 'model.layers.1.mlp.up_proj.weight', 'model.layers.34.mlp.gate_proj.weight', 'model.layers.23.self_attn.o_proj.weight', 'model.layers.8.mlp.gate_proj.weight', 'model.layers.22.self_attn.k_proj.weight', 'model.layers.10.mlp.gate_proj.weight', 'model.layers.19.mlp.down_proj.weight', 'model.layers.24.mlp.gate_proj.weight', 'model.layers.33.self_attn.q_proj.weight', 'model.layers.2.self_attn.v_proj.weight', 'model.layers.29.self_attn.o_proj.weight', 'model.layers.37.self_attn.k_proj.weight', 'model.layers.11.mlp.up_proj.weight', 'model.layers.29.self_attn.k_proj.weight', 'model.layers.35.self_attn.q_proj.weight', 'model.layers.10.self_attn.v_proj.weight', 'model.layers.8.self_attn.q_proj.weight', 'model.layers.9.self_attn.v_proj.weight', 'model.layers.13.self_attn.v_proj.weight', 'model.layers.23.mlp.up_proj.weight', 'model.layers.39.self_attn.v_proj.weight', 'model.layers.22.self_attn.v_proj.weight', 'model.layers.34.self_attn.k_proj.weight', 'model.layers.3.self_attn.q_proj.weight', 'model.layers.16.mlp.up_proj.weight', 'model.layers.0.mlp.up_proj.weight', 'model.layers.25.mlp.gate_proj.weight', 'model.layers.6.self_attn.o_proj.weight', 'model.layers.20.mlp.up_proj.weight', 'model.layers.3.mlp.up_proj.weight', 'model.layers.10.mlp.up_proj.weight', 'model.layers.28.self_attn.o_proj.weight', 'model.layers.29.mlp.gate_proj.weight', 'model.layers.10.self_attn.k_proj.weight', 'model.layers.5.self_attn.v_proj.weight', 'model.layers.3.mlp.gate_proj.weight', 'model.layers.6.self_attn.q_proj.weight', 'model.layers.25.self_attn.v_proj.weight', 'model.layers.36.self_attn.q_proj.weight', 'model.layers.6.mlp.up_proj.weight', 'model.layers.36.mlp.gate_proj.weight', 'model.layers.36.mlp.down_proj.weight', 'model.layers.37.self_attn.v_proj.weight', 'model.layers.33.mlp.down_proj.weight', 'model.layers.37.mlp.up_proj.weight', 'model.layers.35.self_attn.o_proj.weight', 'model.layers.5.mlp.up_proj.weight', 'model.layers.15.mlp.gate_proj.weight', 'model.layers.37.self_attn.q_proj.weight', 'model.layers.19.self_attn.q_proj.weight', 'model.layers.0.self_attn.o_proj.weight', 'model.layers.24.self_attn.v_proj.weight', 'model.layers.28.mlp.gate_proj.weight', 'model.layers.21.self_attn.v_proj.weight', 'model.layers.9.self_attn.o_proj.weight', 'model.layers.11.mlp.down_proj.weight', 'model.layers.30.mlp.up_proj.weight', 'model.layers.3.self_attn.o_proj.weight', 'model.layers.12.self_attn.q_proj.weight', 'model.layers.16.mlp.down_proj.weight', 'model.layers.18.self_attn.k_proj.weight', 'model.layers.31.self_attn.o_proj.weight', 'model.layers.22.self_attn.o_proj.weight', 'model.layers.22.self_attn.q_proj.weight', 'model.layers.0.self_attn.k_proj.weight', 'model.layers.14.mlp.down_proj.weight', 'model.layers.24.mlp.down_proj.weight', 'model.layers.28.mlp.down_proj.weight', 'model.layers.36.self_attn.k_proj.weight', 'model.layers.15.self_attn.q_proj.weight', 'model.layers.13.mlp.gate_proj.weight', 'model.layers.12.mlp.down_proj.weight'} while saving. This should be OK, but check by verifying that you don't receive any warning while reloading
{'loss': 9.7239, 'grad_norm': 92.50677642180415, 'learning_rate': 0.0, 'epoch': 1.0}
{'train_runtime': 15.4474, 'train_samples_per_second': 1.877, 'train_steps_per_second': 0.065, 'train_loss': 9.723894119262695, 'epoch': 1.0}
***** train metrics *****
epoch = 1.0
train_loss = 9.7239
train_runtime = 0:00:15.44
train_samples_per_second = 1.877
train_steps_per_second = 0.065
but the train second is too short so that I haven't decided whether it works well or not
I found the same issue in other projects. Maybe this will help: https://github.com/huggingface/transformers/issues/27293#issuecomment-1815681831
@18907305772 okay, I found that even base model for orion works well in your suggestions (using tokenize_and_patch_dataset
python code, but not blending)
It gives no NaN issues for me, so that I suspect that it is something not compatible with FuseLLM with Orion - but don't know the root cause of the problem.
Can you check the jupyter notebook below to debug?
https://drive.google.com/file/d/1ROj4F_FWsdaF6QGlEI2arMnBJ5P2xtWE/view?usp=sharing
Thanks for your help!
I suspect that there is something wrong with the transformers version, and safe_serialization parameter. You can use transformers==4.35.1
and safe_serialization=False
following https://github.com/18907305772/FuseLLM/issues/9#issuecomment-1993546418.
It seems that this issue may impact the process of saving the model, but it should not affect the training process.
@18907305772 okay, I am trying to retry the process at all, but got this issue:
!deepspeed --master_port=20001 ./FuseLLM/FuseLLM/src/train.py \
--training_mode full \
--deepspeed /home/sionic/sigrid/FuseLLM/FuseLLM/config/zero_stage2_config.json \
--model_name_or_path "OrionStarAI/Orion-14B-Base" \
--output_dir "/home/sionic/sigrid/fusellm-test/models/output" \
--model_max_length 2048 \
--logging_steps 1 \
--save_strategy steps \
--save_steps 500 \
--save_total_limit 1 \
--logging_strategy steps \
--do_train \
--bf16 True \
--tf32 True \
--warmup_ratio 0.008 \
--lr_scheduler_type cosine \
--dataset_name "/home/sionic/sigrid/fusellm-test/datasets/packing/240313_packing_set" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 32 \
--num_train_epochs 1 \
--optim adamw_torch \
--adam_beta1 0.9 \
--adam_beta2 0.95 \
--learning_rate 1e-7 \
--weight_decay 0.1 \
--max_grad_norm 1.0 \
--seed 42 \
--use_flash_attn True \
--lm_loss_weight 0.9 \
--distill_greater_as_gt True \
--distill_greater_as_gt_type "hard" \
--dataloader_num_workers 10 \
--report_to wandb \
--gradient_checkpointing False \
--remove_unused_columns False \
--safe_serialization False
Is there something I have missed during the training process?
[2024-03-13 17:54:42,122] [INFO] [launch.py:316:sigkill_handler] Killing subprocess 3362901
Traceback (most recent call last):
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 136, in <module>
train()
File "/home/sionic/sigrid/./FuseLLM/FuseLLM/src/train.py", line 105, in train
train_result = trainer.train()
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1555, in train
return inner_training_loop(
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 1860, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2725, in training_step
loss = self.compute_loss(model, inputs)
File "/home/sionic/.venv/lib/python3.10/site-packages/transformers/trainer.py", line 2748, in compute_loss
outputs = model(**inputs)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
loss = self.module(*inputs, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/sionic/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
TypeError: OrionForCausalLM.forward() got an unexpected keyword argument 'per_step_logits'
@18907305772 Changing base model to llama2-7b happens the same; TypeError: LlamaForCausalLM.forward() got an unexpected keyword argument 'per_step_logits'
This issue intermittently appears so that it is VERY difficult to debug.
The following is the dataset_info.json which was rendered after running packing.py
...
{
"citation": "",
"description": "",
"features": {
"input_ids": {
"feature": {
"dtype": "int32",
"_type": "Value"
},
"_type": "Sequence"
},
"attention_mask": {
"feature": {
"dtype": "int8",
"_type": "Value"
},
"_type": "Sequence"
},
"labels": {
"feature": {
"dtype": "int64",
"_type": "Value"
},
"_type": "Sequence"
},
"per_step_logits": {
"feature": {
"feature": {
"dtype": "float16",
"_type": "Value"
},
"_type": "Sequence"
},
"_type": "Sequence"
},
"per_step_indices": {
"feature": {
"feature": {
"dtype": "int64",
"_type": "Value"
},
"_type": "Sequence"
},
"_type": "Sequence"
},
"metric_ce": {
"dtype": "float64",
"_type": "Value"
},
"per_step_aligned_logits_0": {
"feature": {
"feature": {
"dtype": "float64",
"_type": "Value"
},
"_type": "Sequence"
},
"_type": "Sequence"
},
"per_step_aligned_indices_0": {
"feature": {
"feature": {
"dtype": "int64",
"_type": "Value"
},
"_type": "Sequence"
},
"_type": "Sequence"
},
"metric_ce_aligned_0": {
"dtype": "float64",
"_type": "Value"
}
},
"homepage": "",
"license": ""
}
If you use dataset with generated representations (per_step_logits), you need to set '--do_distill True' as the script in README (This will use DistillTrainer and start FuseLLM training). If you use tokenized dataset without generated representations(per_step_logits), you need to set '--do_distill False' to start casual language model training (This will use original Trainer).
Dear FuseLLM author,
I am currently attempting to use FuseLLM to fine-tune for Korean models by configuring OrionStarAI/Orion-14B-Base as a base model, and
beomi/OPEN-SOLAR-KO-10.7B
andbeomi/Yi-Ko-6B
to be blending model using DeepSpeed.However, I am encountering NaN (Not a Number) values for grad_norm and loss during the training process. I suspect that the issue might be related to the change of the base model to OrionForCausalLM. I would greatly appreciate your help in resolving this problem.
Problem Description:
When I initiate the training process using DeepSpeed with the OrionForCausalLM model, I observe the following behavior, with flash attention is turned on (grad_norm is nan from the beginning)
When even turning off the flash attention, the first batch comes with no problem of
nan
values, but from the second batch, I encountered the same issue ofnan
of grad_norm as show below.As you can see, the grad_norm and loss values become NaN early in the training process. I have tried reducing the learning rate, but the results remain similar. This leads me to suspect that there might be an issue with the dataset or the compatibility between FuseLLM and the OrionForCausalLM model.
Attempted Solutions:
I have attempted the following steps to address the issue:
Request for Assistance:
I would greatly appreciate your guidance on the following aspects:
I would be grateful for any insights or advice you can offer to help me resolve this issue. I am keen on successfully fine-tuning the OrionForCausalLM model using FuseLLM and would appreciate your expertise in overcoming this obstacle.
Here is the link for the jupyter notebook that I have used on A100 x8: https://drive.google.com/file/d/1woAJvmJNhjF_abtZDOvo8MXXP54KVScr/view?usp=sharing
Thank you in advance for your time and assistance.