UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.98k stars 2.44k forks source link

Problems on training models with NLI #642

Open Chrakimnas6 opened 3 years ago

Chrakimnas6 commented 3 years ago

Hi,

Currently I'm using training_nli.py directly and try to test different pretrained models from huggingface. Some models are fine but I met two problems with xlnet and gpt-2.

First, I used 'xlnet-base-cased' and when I train the model it says:

/usr/local/lib/python3.6/dist-packages/scipy/stats/stats.py:3508: PearsonRConstantInputWarning: An input array is constant; the correlation coefficent is not defined.
  warnings.warn(PearsonRConstantInputWarning())
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py:2559: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[:, None]
/usr/local/lib/python3.6/dist-packages/numpy/lib/function_base.py:2560: RuntimeWarning: invalid value encountered in true_divide
  c /= stddev[None, :]

As a result, it only generates 'similarity_evaluation_sts-dev_results.csv' in the output file and all the values are 0 in the csv.

Second, I also used 'gpt2' and it gives me:

Using pad_token, but it is not set yet.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-45-832e7dbc55be> in <module>()
      5           evaluation_steps=1000,
      6           warmup_steps=warmup_steps,
----> 7           output_path=model_save_path
      8           )

8 frames
/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in _get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
   2094         if padding_strategy != PaddingStrategy.DO_NOT_PAD and (not self.pad_token or self.pad_token_id < 0):
   2095             raise ValueError(
-> 2096                 "Asking to pad but the tokenizer does not have a padding token. "
   2097                 "Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` "
   2098                 "or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`."

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as `pad_token` `(tokenizer.pad_token = tokenizer.eos_token e.g.)` or add a new pad token via `tokenizer.add_special_tokens({'pad_token': '[PAD]'})`.

Not really sure how to fix them, would be really appreciated if someone could help me out. Thanks.

nreimers commented 3 years ago

XLNet and (likely GPT2) currently don't work, as they use a different padding strategy, which is currently not supported within the batching strategy that is used here.

In the upcoming version 0.4.1, tokenization and padding will change and it is likely that XLNet will work (and GPT2 maybe also, never used GPT2).

However, I did test with XLNet and it was not producing any good results. In all my experiments it performed so far quite badly.

Chrakimnas6 commented 3 years ago

Thank you for your reply!