ThilinaRajapakse / simpletransformers

Transformers for Information Retrieval, Text Classification, NER, QA, Language Modelling, Language Generation, T5, Multi-Modal, and Conversational AI
https://simpletransformers.ai/
Apache License 2.0
4.08k stars 727 forks source link

TypeError when fine-tuning a LanguageModelingModel with sliding window #1491

Open yaeliseli opened 1 year ago

yaeliseli commented 1 year ago

Describe the bug I want to fine-tune the XLMRoberta model with masked language modeling. So I used the LanguageModelingModel class to create a model and trained it with my data (with simple data type). As my documents are large, I wanted to use the sliding window option, but then I got an error:

  File ".../mlm_sliding_window.py", line 101, in train
    model.train_model(train_data, eval_file=eval_data, args=args)
  File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 460, in train_model
    train_dataset = self.load_and_cache_examples(train_file, verbose=verbose)
  File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 1393, in load_and_cache_examples
    return SimpleDataset(
  File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_utils.py", line 159, in __init__
    self.examples = [encode_sliding_window(line) for line in lines]
  File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_utils.py", line 159, in <listcomp>
    self.examples = [encode_sliding_window(line) for line in lines]
  File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_utils.py", line 44, in encode_sliding_window
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
  File ".../venv/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 579, in convert_tokens_to_ids
    ids.append(self._convert_token_to_id_with_added_voc(token))
  File ".../venv/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 588, in _convert_token_to_id_with_added_voc
    return self._convert_token_to_id(token)
  File ".../venv/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 297, in _convert_token_to_id
    spm_id = self.sp_model.PieceToId(token)
  File ".../venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1045, in _batched_func
    return _func(self, arg)
  File ".../venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1039, in _func
    return func(v, n)
  File ".../venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 274, in PieceToId
    return _sentencepiece.SentencePieceProcessor_PieceToId(self, piece)
TypeError: not a string

Process finished with exit code 1

After some research and debugging, I think the problem is here: https://github.com/ThilinaRajapakse/simpletransformers/blob/master/simpletransformers/language_modeling/language_modeling_utils.py#L37

When encoding the data with sliding window, the special tokens are used with their ids:

sep_token = tokenizer.sep_token_id
cls_token = tokenizer.cls_token_id
pad_token = tokenizer.pad_token_id

for tokens in token_sets:
    tokens = [cls_token] + tokens + [sep_token]

    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    padding_length = max_seq_length - len(input_ids)
    input_ids = input_ids + ([pad_token] * padding_length)
    # ...

But then convert_tokens_to_ids is applied to the tokens, some of which are ids and not strings, so an error is raised. So I think that lines 37 and 38 should be replaced by the following code:

sep_token = tokenizer.sep_token
cls_token = tokenizer.cls_token

The pad_token id is in my opinion correct, because it is used after the conversion from token to id.

What do you think? I tested it in debug mode, and it seemed to work.

To Reproduce Run the following code:

from simpletransformers.config.model_args import LanguageModelingArgs
from simpletransformers.language_modeling import LanguageModelingModel
import torch

cuda_available = torch.cuda.is_available()

model_args = LanguageModelingArgs(
    overwrite_output_dir=True,
    no_cache=True,
    reprocess_input_data=True,
    sliding_window=True,
)
model = LanguageModelingModel("xlmroberta", "xlm-roberta-base", args=model_args, use_cuda=cuda_available)

train_args = {
    "train_batch_size": 4,
    "n_gpu": torch.cuda.device_count(),
    "reprocess_input_data": True,
    "evaluate_during_training": True,
    "evaluate_during_training_verbose": True,
    "use_multiprocessing_for_evaluation": True,
    "eval_batch_size": 8,
}

model.train_model("train.txt", eval_file="eval.txt", args=train_args)

Expected behavior A usual fine-tuning run (for example without the sliding window), without error.

Desktop

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.