Using distributed or parallel set-up in script?: parallel
Information
I am trying to train a language adapter using XLM-RoBERTa model on Oscar corpus provided by the HuggingFace.
The problem arises when using:
the official example script: /examples/language-modeling/run_mlm.py
I want to perform masked language modeling on BERT and RoBERTa multilingual models using the provided script in the examples. The issue happens when the script tries to concatenate already tokenized texts.
To reproduce
Run the script with the command (using --dataset_name ) provided in the examples/language-modeling/ section.
The problem arises in the group_texts method which has as input a tokenized text as it can be seen bellow. (line 313)
concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
TypeError: can only concatenate list (not "int") to list
Expected behavior
Maybe the texts should be first concatenated and than tokenized or there is a better way to concatenate the texts before generating max_length chunks.
Moreover, it would be nice if there is a working example how to train language adapter using Datasets.
The issue was that some datasets from the Datasets library have a numerical column ( id in my case) despite the text column. To fix this, remove_columns should be used.
Environment info
transformers
version: 1.1.1Information
I am trying to train a language adapter using XLM-RoBERTa model on Oscar corpus provided by the HuggingFace.
The problem arises when using:
I want to perform masked language modeling on BERT and RoBERTa multilingual models using the provided script in the examples. The issue happens when the script tries to concatenate already tokenized texts.
To reproduce
Expected behavior
Maybe the texts should be first concatenated and than tokenized or there is a better way to concatenate the texts before generating max_length chunks. Moreover, it would be nice if there is a working example how to train language adapter using Datasets.