MNLI dataset issue - Githubissues

AnnaSou commented 3 years ago

Hello, I am trying to run fine-tuning for MNLI dataset. So, I am doing the following steps:

bash fetch_data.sh  
python prepare_train_test.py --dataset mnli --create_data  --filter_repetitions
bash run_finetune_gpt2m.sh 1 all 1

where run_finetune_gpt2m.sh looks like this:

GPUDEV=$1
DATAROOT=$2
BSZ=$3
cmd="CUDA_VISIBLE_DEVICES="$GPUDEV"  python finetune_lm.py  \
    --cache_dir ./cache \
    --output_dir=./saved_lm/gpt2_m_"$DATAROOT"  \
    --per_gpu_train_batch_size $BSZ
    --per_gpu_eval_batch_size $BSZ \
    --model_type=gpt2 \
    --model_name_or_path=gpt2-medium \
    --do_train \
    --block_size 128 \
    --save_steps 1000 \
    --num_train_epochs 3 \
    --train_data_file=./dataset_mnli/"$DATAROOT"/train.tsv \
    --do_eval \
    --eval_data_file=./dataset_mnli/"$DATAROOT"/dev.tsv"
echo $cmd
eval $cmd

And I am getting the following error in the lm_utils.py:

To use data.metrics please install scikit-learn. See https://scikit-learn.org/stable/index.html
10/28/2020 22:44:24 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
10/28/2020 22:44:24 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-config.json from cache at ./cache/98aa65385e18b0efd17acd8bf64dcdf21406bb0c99c801c2d3c9f6bfd1f48f29.250d6dc755ccb17d19c7c1a7677636683aa35f0f6cb5461b3c0587bc091551a0
10/28/2020 22:44:24 - INFO - transformers.configuration_utils -   Model config {
  "activation_function": "gelu_new",
  "architectures": [
    "GPT2LMHeadModel"
  ],
  "attn_pdrop": 0.1,
  "bos_token_id": 50256,
  "embd_pdrop": 0.1,
  "eos_token_id": 50256,
  "finetuning_task": null,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_epsilon": 1e-05,
  "model_type": "gpt2",
  "n_ctx": 1024,
  "n_embd": 1024,
  "n_head": 16,
  "n_layer": 24,
  "n_positions": 1024,
  "n_special": 0,
  "num_labels": 2,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "predict_special_tokens": true,
  "pruned_heads": {},
  "resid_pdrop": 0.1,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type": "cls_index",
  "summary_use_proj": true,
  "task_specific_params": {
    "text-generation": {
      "do_sample": true,
      "max_length": 50
    }
  },
  "torchscript": false,
  "use_bfloat16": false,
  "vocab_size": 50257
}

10/28/2020 22:44:24 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-vocab.json from cache at ./cache/f20f05d3ae37c4e3cd56764d48e566ea5adeba153dcee6eb82a18822c9c731ec.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
10/28/2020 22:44:24 - INFO - transformers.tokenization_utils -   loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-merges.txt from cache at ./cache/6d882670c55563617571fe0c97df88626fb5033927b40fc18a8acf98dafd4946.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
10/28/2020 22:44:24 - INFO - transformers.tokenization_utils -   Adding [EXP] to the vocabulary
10/28/2020 22:44:24 - INFO - transformers.tokenization_utils -   Adding [EOS] to the vocabulary
10/28/2020 22:44:24 - INFO - transformers.modeling_utils -   loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-medium-pytorch_model.bin from cache at ./cache/4b337a4f3b7d3e1518f799e238af607498c02938a3390152aaec7d4dabca5a02.8769029be4f66a5ae1055eefdd1d11621b901d510654266b8681719fff492d6e
10/28/2020 22:44:39 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=128, cache_dir='./cache', config_name='', data_type='tsv', device=device(type='cpu'), do_eval=True, do_generate=False, do_lower_case=False, do_train=True, eval_all_checkpoints=False, eval_data_file='./dataset_mnli/all/dev.tsv', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, length=100, local_rank=-1, logging_steps=50, max_grad_norm=1.0, max_steps=-1, model_name_or_path='gpt2-medium', model_type='gpt2', n_gpu=0, no_cuda=False, num_train_epochs=3.0, output_dir='./saved_lm/gpt2_m_all', overwrite_cache=False, overwrite_output_dir=False, per_gpu_eval_batch_size=1, per_gpu_train_batch_size=1, save_steps=1000, seed=42, server_ip='', server_port='', tokenizer_name='', train_data_file='./dataset_mnli/all/train.tsv', warmup_steps=0, weight_decay=0.0)
Traceback (most recent call last):
  File "/fs/clip-scratch/annasout/miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2895, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 70, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 101, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1675, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1683, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'target'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "finetune_lm.py", line 503, in <module>
    main()
  File "finetune_lm.py", line 457, in main
    block_size=args.block_size, get_annotations=True)
  File "/fs/clip-scratch/annasout/transformers/nile/lm_utils.py", line 48, in __init__
    self.examples = data.apply(create_example, axis=1).to_list()
  File "/fs/clip-scratch/annasout/miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py", line 7548, in apply
    return op.get_result()
  File "/fs/clip-scratch/annasout/miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py", line 180, in get_result
    return self.apply_standard()
  File "/fs/clip-scratch/annasout/miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py", line 271, in apply_standard
    results, res_index = self.apply_series_generator()
  File "/fs/clip-scratch/annasout/miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/apply.py", line 300, in apply_series_generator
    results[i] = self.f(v)
  File "/fs/clip-scratch/annasout/transformers/nile/lm_utils.py", line 34, in create_example
    text2 = r['target']
  File "/fs/clip-scratch/annasout/miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/series.py", line 882, in __getitem__
    return self._get_value(key)
  File "/fs/clip-scratch/annasout/miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/series.py", line 989, in _get_value
    loc = self.index.get_loc(label)
  File "/fs/clip-scratch/annasout/miniconda3/envs/py36/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2897, in get_loc
    raise KeyError(key) from err
KeyError: 'target'
srun: error: materialgpu00: task 0: Exited with exit code 1

Thank you for the help!

SawanKumar28 commented 3 years ago

Explanations are available only in e-SNLI dataset, and not in MNLI dataset.

AnnaSou commented 3 years ago

Thank you for the reply! how would we run the fine-tuning for MNLI dataset( without explanations)? Thanks!

SawanKumar28 commented 3 years ago

We don't train explanation generators on MNLI in the paper. To train on MNLI, you will need target explanations to learn from.

SawanKumar28 / nile

MNLI dataset issue #4