adapter-hub / adapters

A Unified Library for Parameter-Efficient and Modular Transfer Learning
https://docs.adapterhub.ml
Apache License 2.0
2.55k stars 340 forks source link

Provided language modeling script run_mlm.py does not work with HuggingFace Datasets library #152

Closed sstojanoska closed 3 years ago

sstojanoska commented 3 years ago

Environment info

Information

I am trying to train a language adapter using XLM-RoBERTa model on Oscar corpus provided by the HuggingFace.

The problem arises when using:

I want to perform masked language modeling on BERT and RoBERTa multilingual models using the provided script in the examples. The issue happens when the script tries to concatenate already tokenized texts.

To reproduce

  1. Run the script with the command (using --dataset_name ) provided in the examples/language-modeling/ section.
  2. The problem arises in the group_texts method which has as input a tokenized text as it can be seen bellow. (line 313)
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    TypeError: can only concatenate list (not "int") to list

Expected behavior

Maybe the texts should be first concatenated and than tokenized or there is a better way to concatenate the texts before generating max_length chunks. Moreover, it would be nice if there is a working example how to train language adapter using Datasets.

sstojanoska commented 3 years ago

The issue was that some datasets from the Datasets library have a numerical column ( id in my case) despite the text column. To fix this, remove_columns should be used.