O-per / cakd3_Project3

8 stars 7 forks source link

ET5 - finetune-t5-ynat.py 실행 중 학습 오류 #29

Closed seuly1203 closed 2 years ago

seuly1203 commented 2 years ago

실행 셀 코드:

!CUDA_VISIBLE_DEVICES=0 python seq2seq_finetune_t5_ynat.py \
--do_train --do_eval --predict_with_generate \
--model_name_or_path /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5 \
--data_dir /content/drive/MyDrive/ET5_test/ynat-v1.1 \
--output_dir /content/drive/MyDrive/ET5_test/output \
--overwrite_output_dir \
--save_steps 100000 \
--per_device_train_batch_size 16 \
--gradient_accumulation_steps 1 \
--num_train_epochs 1.0

오류 메세지:

12/01/2021 05:14:34 - WARNING - __main__ -   Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
12/01/2021 05:14:34 - INFO - __main__ -   Training/evaluation parameters Seq2SeqTrainingArguments(output_dir='/content/drive/MyDrive/ET5_test/output', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=<EvaluationStrategy.NO: 'no'>, prediction_loss_only=False, per_device_train_batch_size=16, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=<SchedulerType.LINEAR: 'linear'>, warmup_steps=0, logging_dir='runs/Dec01_05-14-34_eb535e39a1b5', logging_first_step=False, logging_steps=500, save_steps=100000, save_total_limit=None, no_cuda=False, seed=42, fp16=False, fp16_opt_level='O1', fp16_backend='auto', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=500, dataloader_num_workers=0, past_index=-1, run_name='/content/drive/MyDrive/ET5_test/output', disable_tqdm=False, remove_unused_columns=True, label_names=None, load_best_model_at_end=False, metric_for_best_model=None, greater_is_better=None, ignore_data_skip=False, sharded_ddp=False, deepspeed=None, label_smoothing_factor=0.0, adafactor=False, group_by_length=False, report_to=['tensorboard'], ddp_find_unused_parameters=None, dataloader_pin_memory=True, label_smoothing=0.0, sortish_sampler=False, predict_with_generate=True, encoder_layerdrop=None, decoder_layerdrop=None, dropout=None, attention_dropout=None, lr_scheduler='linear')
[INFO|configuration_utils.py:447] 2021-12-01 05:14:34,267 >> loading configuration file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/config.json
[INFO|configuration_utils.py:485] 2021-12-01 05:14:34,267 >> Model config T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 45100
}

[INFO|configuration_utils.py:447] 2021-12-01 05:14:34,269 >> loading configuration file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/config.json
[INFO|configuration_utils.py:485] 2021-12-01 05:14:34,269 >> Model config T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 45100
}

[INFO|tokenization_utils_base.py:1688] 2021-12-01 05:14:34,269 >> Model name '/content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5' not found in model shortcut name list (t5-small, t5-base, t5-large, t5-3b, t5-11b). Assuming '/content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5' is a path, a model identifier, or url to a directory containing tokenizer files.
[INFO|tokenization_utils_base.py:1721] 2021-12-01 05:14:34,271 >> Didn't find file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/tokenizer.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-12-01 05:14:34,271 >> Didn't find file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/added_tokens.json. We won't load it.
[INFO|tokenization_utils_base.py:1721] 2021-12-01 05:14:34,272 >> Didn't find file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/special_tokens_map.json. We won't load it.
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/spiece.model
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file None
[INFO|tokenization_utils_base.py:1784] 2021-12-01 05:14:34,273 >> loading file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/tokenizer_config.json
[INFO|modeling_utils.py:1025] 2021-12-01 05:14:34,456 >> loading weights file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/pytorch_model.bin
[INFO|modeling_utils.py:1143] 2021-12-01 05:14:42,207 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-12-01 05:14:42,207 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
#####    Reading an input file ...   /content/drive/MyDrive/ET5_test/ynat-v1.1/train.json
#####    Create examples ... : 45678it [00:00, 666217.22it/s]
#####    Get source and target texts ... : 100% 45677/45677 [00:00<00:00, 1432044.61it/s]
#####    Reading an input file ...   /content/drive/MyDrive/ET5_test/ynat-v1.1/val.json
#####    Create examples ... : 9107it [00:00, 694942.72it/s]
#####    Get source and target texts ... : 100% 9106/9106 [00:00<00:00, 1586826.72it/s]
12/01/2021 05:14:48 - INFO - __main__ -   *** Train ***
/usr/local/lib/python3.7/dist-packages/transformers/trainer.py:705: FutureWarning: `model_path` is deprecated and will be removed in a future version. Use `resume_from_checkpoint` instead.
  FutureWarning,
[INFO|trainer.py:724] 2021-12-01 05:14:48,220 >> Loading model from /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5).
[INFO|configuration_utils.py:447] 2021-12-01 05:14:48,222 >> loading configuration file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/config.json
[INFO|configuration_utils.py:485] 2021-12-01 05:14:48,222 >> Model config T5Config {
  "architectures": [
    "T5ForConditionalGeneration"
  ],
  "d_ff": 3072,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 0,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.3.2",
  "use_cache": true,
  "vocab_size": 45100
}

[INFO|modeling_utils.py:1025] 2021-12-01 05:14:48,224 >> loading weights file /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5/pytorch_model.bin
[INFO|modeling_utils.py:1143] 2021-12-01 05:14:55,663 >> All model checkpoint weights were used when initializing T5ForConditionalGeneration.

[INFO|modeling_utils.py:1152] 2021-12-01 05:14:55,663 >> All the weights of T5ForConditionalGeneration were initialized from the model checkpoint at /content/drive/MyDrive/cakd3_3차프로젝트_2조/Datasets/ETRI_ET5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use T5ForConditionalGeneration for predictions without further training.
[INFO|trainer.py:837] 2021-12-01 05:14:56,744 >> ***** Running training *****
[INFO|trainer.py:838] 2021-12-01 05:14:56,744 >>   Num examples = 45676
[INFO|trainer.py:839] 2021-12-01 05:14:56,744 >>   Num Epochs = 1
[INFO|trainer.py:840] 2021-12-01 05:14:56,744 >>   Instantaneous batch size per device = 16
[INFO|trainer.py:841] 2021-12-01 05:14:56,744 >>   Total train batch size (w. parallel, distributed & accumulation) = 16
[INFO|trainer.py:842] 2021-12-01 05:14:56,744 >>   Gradient Accumulation steps = 1
[INFO|trainer.py:843] 2021-12-01 05:14:56,744 >>   Total optimization steps = 2855
  0% 0/2855 [00:00<?, ?it/s]Traceback (most recent call last):
  File "seq2seq_finetune_t5_ynat.py", line 379, in <module>
    main()
  File "seq2seq_finetune_t5_ynat.py", line 316, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 940, in train
    tr_loss += self.training_step(model, inputs)
  File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 1320, in training_step
    loss.backward()
  File "/usr/local/lib/python3.7/dist-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/usr/local/lib/python3.7/dist-packages/torch/autograd/__init__.py", line 147, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`
  0% 0/2855 [00:00<?, ?it/s]
seuly1203 commented 2 years ago

비슷한 이슈 https://github.com/allenai/allennlp/issues/5064#issue-836334672

seuly1203 commented 2 years ago

CUDA에러 발생 이유: CUDA / 사용 라이브러리 버전이 맞지 않는 경우 혹은 입력 데이터 형식이 이상할 경우