Provided language modeling script run_mlm.py does not work with HuggingFace Datasets library

Environment info

transformers version: 1.1.1
Platform: Windows-10-10.0.19041-SP0
Python version: 3.7.7
PyTorch version (GPU?): 1.7.1+cu110 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: yes
Using distributed or parallel set-up in script?: parallel

Information

I am trying to train a language adapter using XLM-RoBERTa model on Oscar corpus provided by the HuggingFace.

The problem arises when using:

the official example script: /examples/language-modeling/run_mlm.py

I want to perform masked language modeling on BERT and RoBERTa multilingual models using the provided script in the examples. The issue happens when the script tries to concatenate already tokenized texts.

To reproduce

Run the script with the command (using --dataset_name ) provided in the examples/language-modeling/ section.

The problem arises in the group_texts method which has as input a tokenized text as it can be seen bellow. (line 313)

concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}

TypeError: can only concatenate list (not "int") to list

Expected behavior

Maybe the texts should be first concatenated and than tokenized or there is a better way to concatenate the texts before generating max_length chunks. Moreover, it would be nice if there is a working example how to train language adapter using Datasets.

adapter-hub / adapters