Error when trying to replicate

AnnaSou commented 4 years ago

Hello! I am trying to replicate the code, and following all the steps up to bash run_finetune_gpt2m.sh 0 entailment 2 Here is the error I am getting: Traceback (most recent call last):

  File "finetune_lm.py", line 503, in <module>
    main()
  File "finetune_lm.py", line 462, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "finetune_lm.py", line 128, in train
    outputs = model(inputs, labels=labels)
  File "./envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./envs/py36/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 549, in forward
    inputs_embeds=inputs_embeds)
  File "./envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./envs/py36/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 439, in forward
    inputs_embeds = self.wte(input_ids)
  File "./envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./envs/py36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "./envs/py36/lib/python3.6/site-packages/torch/nn/functional.py", line 1814, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

Any idea what I might be doing wrong?

Here is the command:

(py36) -bash-4.2$ bash run_finetune_gpt2m.sh 0 entailment 2
CUDA_VISIBLE_DEVICES=0 python finetune_lm.py --cache_dir ./cache --output_dir=./saved_lm/gpt2_m_entailment --per_gpu_train_batch_size 2 --per_gpu_eval_batch_size 2 --model_type=gpt2 --model_name_or_path=gpt2-medium --do_train --block_size 128 --save_steps 6866800 --num_train_epochs 3 --train_data_file=./dataset_snli/entailment/train.tsv --do_eval --eval_data_file=./dataset_snli/entailment/dev.tsv

Thanks!

vibhavagarwal5 commented 4 years ago

check for the ignore_index(-1) or ignore_index(-100) issue in the loss part of the gpt2 model. New pytorch has this modification. I fixed my issue through this.

SawanKumar28 commented 4 years ago

Thanks @vibhavagarwal5 , that does seem like the issue. @AnnaSou , the code was tested with transformers v2.3.0 where the ignore_index is set to -1. Can you confirm that you are using the same version? If not, you may need to change the ignore_index accordingly (cross_entropy_ignore_index in finetune_lm.py).

AnnaSou commented 4 years ago

@vibhavagarwal5 @SawanKumar28 thank you all for replies! transformers version:

(py36) -bash-4.2$ pip freeze | grep transformers
transformers==2.3.0

I have installed everything from the requirements.

By default, the cross_entropy_ignore_index=-1, and I tried -100, 0 as well, nothing works. What else I can check?

Thank you for the help!

SawanKumar28 commented 4 years ago

Which pytorch version are you using?

AnnaSou commented 4 years ago

@SawanKumar28 I think that it was installed from the requirements file.

(py36) -bash-4.2$ python -c "import torch; print(torch.__version__)"
1.6.0

SawanKumar28 commented 4 years ago

I am not able to reproduce the issue. When you make these changes, please delete cache files named cachedlm_ (in dataset_snli/entailment/ for this case). Could you also share the log from the beginning?

AnnaSou commented 4 years ago

@SawanKumar28 Thank you for the help. I cleaned up everything and started from the beginning following instructions from this page. Here is the complete log:

bash run_finetune_gpt2m.sh 0 entailment 2
CUDA_VISIBLE_DEVICES=0 python finetune_lm.py --cache_dir ./cache --output_dir=./saved_lm/gpt2_m_entailment --per_gpu_train_batch_size 2 --per_gpu_eval_batch_size 2 --model_type=gpt2 --model_name_or_path=gpt2-medium --do_train --block_size 128 --save_steps 6866800 --num_train_epochs 3 --train_data_file=./dataset_snli/entailment/train.tsv --do_eval --eval_data_file=./dataset_snli/entailment/dev.tsv
To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
10/26/2020 22:42:47 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
10/26/2020 22:42:47 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json not found in cache or force_download set to True, downloading to /tmp/tmp5u9jqs7m
10/26/2020 22:42:47 - INFO - transformers.file_utils -   copying /tmp/tmp5u9jqs7m to cache at ./cache/98aa65385e18b0efd17acd8bf64dcdf21406bb0c99c801c2d3c9f6bfd1f48f29.250d6dc755ccb17d19c7c1a7677636683aa35f0f6cb5461b3c0587bc091551a0
10/26/2020 22:42:47 - INFO - transformers.file_utils -   creating metadata file for ./cache/98aa65385e18b0efd17acd8bf64dcdf21406bb0c99c801c2d3c9f6bfd1f48f29.250d6dc755ccb17d19c7c1a7677636683aa35f0f6cb5461b3c0587bc091551a0
10/26/2020 22:42:47 - INFO - transformers.file_utils -   removing temp file /tmp/tmp5u9jqs7m
10/26/2020 22:42:47 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json from cache at ./cache/98aa65385e18b0efd17acd8bf64dcdf21406bb0c99c801c2d3c9f6bfd1f48f29.250d6dc755ccb17d19c7c1a7677636683aa35f0f6cb5461b3c0587bc091551a0
10/26/2020 22:42:47 - INFO - transformers.configuration_utils -   Model config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "predict_special_tokens": true,
  "pruned_heads": {},
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torchscript": false,
  "use_bfloat16": false,
  "vocab_size": 50257
}

10/26/2020 22:42:47 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json not found in cache or force_download set to True, downloading to /tmp/tmpnczzt7ar
10/26/2020 22:42:48 - INFO - transformers.file_utils -   copying /tmp/tmpnczzt7ar to cache at ./cache/f20f05d3ae37c4e3cd56764d48e566ea5adeba153dcee6eb82a18822c9c731ec.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
10/26/2020 22:42:48 - INFO - transformers.file_utils -   creating metadata file for ./cache/f20f05d3ae37c4e3cd56764d48e566ea5adeba153dcee6eb82a18822c9c731ec.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
10/26/2020 22:42:48 - INFO - transformers.file_utils -   removing temp file /tmp/tmpnczzt7ar
10/26/2020 22:42:48 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt not found in cache or force_download set to True, downloading to /tmp/tmp4z9_f_2s
10/26/2020 22:42:48 - INFO - transformers.file_utils -   copying /tmp/tmp4z9_f_2s to cache at ./cache/6d882670c55563617571fe0c97df88626fb5033927b40fc18a8acf98dafd4946.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
10/26/2020 22:42:48 - INFO - transformers.file_utils -   creating metadata file for ./cache/6d882670c55563617571fe0c97df88626fb5033927b40fc18a8acf98dafd4946.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
10/26/2020 22:42:48 - INFO - transformers.file_utils -   removing temp file /tmp/tmp4z9_f_2s
10/26/2020 22:42:48 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json from cache at ./cache/f20f05d3ae37c4e3cd56764d48e566ea5adeba153dcee6eb82a18822c9c731ec.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
10/26/2020 22:42:48 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt from cache at ./cache/6d882670c55563617571fe0c97df88626fb5033927b40fc18a8acf98dafd4946.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
#tokens 50257
10/26/2020 22:42:48 - INFO - transformers.tokenization_utils -   Adding [EXP] to the vocabulary
10/26/2020 22:42:48 - INFO - transformers.tokenization_utils -   Adding [EOS] to the vocabulary
#extended tokens 50259
10/26/2020 22:42:48 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin not found in cache or force_download set to True, downloading to /tmp/tmp4k3y65g0
10/26/2020 22:43:52 - INFO - transformers.file_utils -   copying /tmp/tmp4k3y65g0 to cache at ./cache/4b337a4f3b7d3e1518f799e238af607498c02938a3390152aaec7d4dabca5a02.8769029be4f66a5ae1055eefdd1d11621b901d510654266b8681719fff492d6e
10/26/2020 22:43:58 - INFO - transformers.file_utils -   creating metadata file for ./cache/4b337a4f3b7d3e1518f799e238af607498c02938a3390152aaec7d4dabca5a02.8769029be4f66a5ae1055eefdd1d11621b901d510654266b8681719fff492d6e
10/26/2020 22:43:58 - INFO - transformers.file_utils -   removing temp file /tmp/tmp4k3y65g0
10/26/2020 22:43:59 - INFO - transformers.modeling_utils -   loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin from cache at ./cache/4b337a4f3b7d3e1518f799e238af607498c02938a3390152aaec7d4dabca5a02.8769029be4f66a5ae1055eefdd1d11621b901d510654266b8681719fff492d6e
10/26/2020 22:49:44 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=128, cache_dir='./cache', config_name='', data_type='tsv', device=device(type='cpu'), do_eval=True, do_generate=False, do_lower_case=False, do_train=True, eval_all_checkpoints=False, eval_data_file='./dataset_snli/entailment/dev.tsv', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, length=100, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_steps=-1, model_name_or_path='gpt2-medium', model_type='gpt2', n_gpu=0, no_cuda=False, num_train_epochs=3.0, output_dir='./saved_lm/gpt2_m_entailment', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=2, per_gpu_train_batch_size=2, save_steps=6866800, seed=42, server_ip='', server_port='', tokenizer_name='', train_data_file='./dataset_snli/entailment/train.tsv', warmup_steps=0, weight_decay=0.0)
                                                                 input                                             target
pairID                                                                                                                   
3416050480.jpg#4r1e  Premise: A person on a horse jumps over a brok...                 a broken down airplane is outdoors
2267923837.jpg#2r1e  Premise: Children smiling and waving at camera...  The children must be present to see them smili...
3691670743.jpg#0r1e  Premise: A boy is jumping on skateboard in the...  jumping on skateboard is the same as doing tri...
4705552913.jpg#4r1e  Premise: Two blond women are hugging one anoth...  Hugging one another is the same as showing aff...
4804607632.jpg#2r1e  Premise: A few people in a restaurant setting,...  A few pepople in a restaurant setting is simil...
...                                                                ...                                                ...
2267923837.jpg#0r1e  Premise: A group of four kids stand in front o...  The four kids that stand in front of the statu...
3691670743.jpg#2r1e  Premise: a kid doing tricks on a skateboard on...  The kid doing tricks on a skateboard must be s...
539750844.jpg#3r1e   Premise: A dog with a blue collar plays ball o...  In order for the dog to play ball he must be o...
2267923837.jpg#3r1e  Premise: Four dirty and barefooted children. H...  The children are dirty and barefooted so they ...
7979219683.jpg#2r1e  Premise: A man is surfing in a bodysuit in bea...  The man is in a bodysuit and he is surfing on ...

[174944 rows x 2 columns]
Saving features from  ./dataset_snli/entailment/train.tsv  into  ./dataset_snli/entailment/cached_lm_128_train.tsv_annotated
example:  Premise: A person on a horse jumps over a broken down airplane. Hypothesis: A person is outdoors, on a horse. [EXP] a broken down airplane is outdoors
example:  Premise: Children smiling and waving at camera Hypothesis: There are children present [EXP] The children must be present to see them smiling and waving.
example:  Premise: A boy is jumping on skateboard in the middle of a red bridge. Hypothesis: The boy does a skateboarding trick. [EXP] jumping on skateboard is the same as doing trick on skateboard.
example:  Premise: Two blond women are hugging one another. Hypothesis: There are women showing affection. [EXP] Hugging one another is the same as showing affection.
example:  Premise: A few people in a restaurant setting, one of them is drinking orange juice. Hypothesis: The diners are at a restaurant. [EXP] A few pepople in a restaurant setting is similar to saying diners in a restaurant.
Saving  174944  examples
10/26/2020 22:51:35 - INFO - __main__ -   ***** Running training *****
10/26/2020 22:51:35 - INFO - __main__ -     Num examples = 174944
10/26/2020 22:51:35 - INFO - __main__ -     Num Epochs = 3
10/26/2020 22:51:35 - INFO - __main__ -     Instantaneous batch size per GPU = 2
10/26/2020 22:51:35 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 2
10/26/2020 22:51:35 - INFO - __main__ -     Gradient Accumulation steps = 1
10/26/2020 22:51:35 - INFO - __main__ -     Total optimization steps = 262416
Iteration:   0%|                                              | 0/87472 [00:00<?, ?it/s]
Epoch:   0%|                                                      | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "finetune_lm.py", line 503, in <module>
    main()
  File "finetune_lm.py", line 462, in main
    global_step, tr_loss = train(args, train_dataset, model, tokenizer)
  File "finetune_lm.py", line 128, in train
    outputs = model(inputs, labels=labels)
  File "/fs/clip-scratch/annasout/transformers/nile/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/fs/clip-scratch/annasout/transformers/nile/py36/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 549, in forward
    inputs_embeds=inputs_embeds)
  File "/fs/clip-scratch/annasout/transformers/nile/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/fs/clip-scratch/annasout/transformers/nile/py36/lib/python3.6/site-packages/transformers/modeling_gpt2.py", line 439, in forward
    inputs_embeds = self.wte(input_ids)
  File "/fs/clip-scratch/annasout/transformers/nile/py36/lib/python3.6/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/fs/clip-scratch/annasout/transformers/nile/py36/lib/python3.6/site-packages/torch/nn/modules/sparse.py", line 126, in forward
    self.norm_type, self.scale_grad_by_freq, self.sparse)
  File "/fs/clip-scratch/annasout/transformers/nile/py36/lib/python3.6/site-packages/torch/nn/functional.py", line 1814, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self

SawanKumar28 commented 4 years ago

In comparison to the default vocabulary of gpt2-medium model, the only additional indices used here correspond to [EXP] token, [EOS] token and the ignore index. Regarding ignore_index, for debugging, you could try setting cross_entropy_ignore_index to 0. Don't use that for actual train/test though.

AnnaSou commented 4 years ago

@SawanKumar28 I have not changed anything in the code, and just run as it is. So, do you think that the problem with indices?

SawanKumar28 commented 4 years ago

Yes, it seems to be. I am not able to reproduce the issue though. Perhaps, you can share the entire output of pip freeze. I'll check if I can reproduce it with that.

AnnaSou commented 4 years ago

@SawanKumar28 thank you for the help:

(py36) -bash-4.2$ pip freeze
boto3==1.16.6
botocore==1.19.6
certifi==2020.6.20
chardet==3.0.4
click==7.1.2
dataclasses==0.7
future==0.18.2
idna==2.10
jmespath==0.10.0
joblib==0.17.0
numpy==1.19.2
pandas==1.1.3
protobuf==3.13.0
python-dateutil==2.8.1
pytz==2020.1
regex==2020.10.23
requests==2.24.0
s3transfer==0.3.3
sacremoses==0.0.43
sentencepiece==0.1.94
six==1.15.0
tensorboardX==2.1
torch==1.7.0
tqdm==4.51.0
transformers==2.3.0
typing-extensions==3.7.4.3
urllib3==1.25.11

SawanKumar28 commented 4 years ago

Looks like there is a bug when running on CPU. In your log, 10/26/2020 22:42:47 - WARNING - __main__ - Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False indicates it's running on CPU. Can you try running on a GPU?

SawanKumar28 commented 4 years ago

I have added a fix for a bug which was overwriting the input tensor when running on CPU. So, it should work on CPUs also now.

AnnaSou commented 4 years ago

Looks like it helped, and I could run it on GPU as well. Thanks for the help!

SawanKumar28 / nile

Error when trying to replicate #3