Describe the bug
I want to fine-tune the XLMRoberta model with masked language modeling. So I used the LanguageModelingModel class to create a model and trained it with my data (with simple data type). As my documents are large, I wanted to use the sliding window option, but then I got an error:
File ".../mlm_sliding_window.py", line 101, in train
model.train_model(train_data, eval_file=eval_data, args=args)
File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 460, in train_model
train_dataset = self.load_and_cache_examples(train_file, verbose=verbose)
File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_model.py", line 1393, in load_and_cache_examples
return SimpleDataset(
File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_utils.py", line 159, in __init__
self.examples = [encode_sliding_window(line) for line in lines]
File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_utils.py", line 159, in <listcomp>
self.examples = [encode_sliding_window(line) for line in lines]
File ".../venv/lib/python3.10/site-packages/simpletransformers/language_modeling/language_modeling_utils.py", line 44, in encode_sliding_window
input_ids = tokenizer.convert_tokens_to_ids(tokens)
File ".../venv/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 579, in convert_tokens_to_ids
ids.append(self._convert_token_to_id_with_added_voc(token))
File ".../venv/lib/python3.10/site-packages/transformers/tokenization_utils.py", line 588, in _convert_token_to_id_with_added_voc
return self._convert_token_to_id(token)
File ".../venv/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 297, in _convert_token_to_id
spm_id = self.sp_model.PieceToId(token)
File ".../venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1045, in _batched_func
return _func(self, arg)
File ".../venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 1039, in _func
return func(v, n)
File ".../venv/lib/python3.10/site-packages/sentencepiece/__init__.py", line 274, in PieceToId
return _sentencepiece.SentencePieceProcessor_PieceToId(self, piece)
TypeError: not a string
Process finished with exit code 1
But then convert_tokens_to_ids is applied to the tokens, some of which are ids and not strings, so an error is raised. So I think that lines 37 and 38 should be replaced by the following code:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Describe the bug I want to fine-tune the XLMRoberta model with masked language modeling. So I used the
LanguageModelingModel
class to create a model and trained it with my data (with simple data type). As my documents are large, I wanted to use thesliding window
option, but then I got an error:After some research and debugging, I think the problem is here: https://github.com/ThilinaRajapakse/simpletransformers/blob/master/simpletransformers/language_modeling/language_modeling_utils.py#L37
When encoding the data with sliding window, the special tokens are used with their ids:
But then
convert_tokens_to_ids
is applied to the tokens, some of which are ids and not strings, so an error is raised. So I think that lines 37 and 38 should be replaced by the following code:The
pad_token id
is in my opinion correct, because it is used after the conversion from token to id.What do you think? I tested it in debug mode, and it seemed to work.
To Reproduce Run the following code:
Expected behavior A usual fine-tuning run (for example without the sliding window), without error.
Desktop