BertStyleLMProcessor not working with german bert model

felixvor commented 3 years ago

Describe the bug I try to finetune a german bert model with my own text corpus. When I attempt to load the data in, the process will crash at assert 103 not in tokens #mask token. I was able to fix the issue by using "bert-base-cased" as the base model. However, I need to finetune "bert-base-german-cased". I tried changing changing other parameters but found that the model name is the cause for the problem, as I am able to reproduce the same error when using the lm_finetune.py from the farm examples and only changing the model name there.

Error message

>> python lm_finetuning.py 
/home/vagrant/anaconda3/envs/nlp/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  return torch._C._cuda_getDeviceCount() > 0
06/09/2021 17:20:28 - INFO - farm.modeling.prediction_head -   Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .

 __          __  _                            _        
 \ \        / / | |                          | |       
  \ \  /\  / /__| | ___ ___  _ __ ___   ___  | |_ ___  
   \ \/  \/ / _ \ |/ __/ _ \| '_ ` _ \ / _ \ | __/ _ \ 
    \  /\  /  __/ | (_| (_) | | | | | |  __/ | || (_) |
     \/  \/ \___|_|\___\___/|_| |_| |_|\___|  \__\___/ 
  ______      _____  __  __  
 |  ____/\   |  __ \|  \/  |              _.-^-._    .--.
 | |__ /  \  | |__) | \  / |           .-'   _   '-. |__|
 |  __/ /\ \ |  _  /| |\/| |          /     |_|     \|  |
 | | / ____ \| | \ \| |  | |         /               \  |
 |_|/_/    \_\_|  \_\_|  |_|        /|     _____     |\ |
                                     |    |==|==|    |  |
|---||---|---|---|---|---|---|---|---|    |--|--|    |  |
|---||---|---|---|---|---|---|---|---|    |==|==|    |  |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

06/09/2021 17:20:30 - INFO - farm.utils -   Using device: CPU 
06/09/2021 17:20:30 - INFO - farm.utils -   Number of GPUs: 0
06/09/2021 17:20:30 - INFO - farm.utils -   Distributed Training: False
06/09/2021 17:20:30 - INFO - farm.utils -   Automatic Mixed Precision: None
06/09/2021 17:20:31 - INFO - farm.modeling.tokenization -   Loading tokenizer of type 'BertTokenizer'
06/09/2021 17:20:33 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      

06/09/2021 17:20:33 - INFO - farm.data_handler.data_silo -   LOADING TRAIN DATA
06/09/2021 17:20:33 - INFO - farm.data_handler.data_silo -   ==================
06/09/2021 17:20:33 - INFO - farm.data_handler.data_silo -   Loading train set from: ../data/lm_finetune_nips/train.txt 
06/09/2021 17:20:34 - INFO - farm.data_handler.data_silo -   Got ya 3 parallel workers to convert 5241 dictionaries to pytorch datasets (chunksize = 21)...
06/09/2021 17:20:34 - INFO - farm.data_handler.data_silo -    0    0    0 
06/09/2021 17:20:34 - INFO - farm.data_handler.data_silo -   /w\  /w\  /w\
06/09/2021 17:20:34 - INFO - farm.data_handler.data_silo -   /'\  / \  /'\
06/09/2021 17:20:34 - INFO - farm.data_handler.data_silo -       
Preprocessing Dataset ../data/lm_finetune_nips/train.txt:   0%|                                                                                                    | 0/5241 [00:02<?, ? Dicts/s]
multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/vagrant/anaconda3/envs/nlp/lib/python3.9/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/vagrant/anaconda3/envs/nlp/lib/python3.9/site-packages/farm/data_handler/data_silo.py", line 132, in _dataset_from_chunk
    dataset, tensor_names, problematic_sample_ids = processor.dataset_from_dicts(dicts=dicts, indices=indices)
  File "/home/vagrant/anaconda3/envs/nlp/lib/python3.9/site-packages/farm/data_handler/processor.py", line 1384, in dataset_from_dicts
    features.append(self._create_labels(sample=sample, vocab_length=vocab_length))
  File "/home/vagrant/anaconda3/envs/nlp/lib/python3.9/site-packages/farm/data_handler/processor.py", line 1667, in _create_labels
    input_ids, lm_label_ids = self._mask_random_words(sample.features["input_ids"], vocab_length, token_groups=sample.tokenized["start_of_word"])
  File "/home/vagrant/anaconda3/envs/nlp/lib/python3.9/site-packages/farm/data_handler/processor.py", line 1727, in _mask_random_words
    assert 103 not in tokens #mask token
AssertionError
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vagrant/Desktop/hello-world/farm/temp_lm.py", line 103, in <module>
    lm_finetuning()
  File "/home/vagrant/Desktop/hello-world/farm/temp_lm.py", line 54, in lm_finetuning
    data_silo = DataSilo(processor=processor, batch_size=batch_size, max_multiprocessing_chunksize=20)
  File "/home/vagrant/anaconda3/envs/nlp/lib/python3.9/site-packages/farm/data_handler/data_silo.py", line 113, in __init__
    self._load_data()
  File "/home/vagrant/anaconda3/envs/nlp/lib/python3.9/site-packages/farm/data_handler/data_silo.py", line 223, in _load_data
    self.data["train"], self.tensor_names = self._get_dataset(train_file)
  File "/home/vagrant/anaconda3/envs/nlp/lib/python3.9/site-packages/farm/data_handler/data_silo.py", line 185, in _get_dataset
    for dataset, tensor_names, problematic_samples in results:
  File "/home/vagrant/anaconda3/envs/nlp/lib/python3.9/multiprocessing/pool.py", line 870, in next
    raise value
AssertionError

Expected behavior Should be able to use Processors and DataSilos with any supported bert model. As long as the data is in the correct format, it should load in and training should start.

To Reproduce Simply use the lm_finetuning.py from the examples folder. Replace line 36 lang_model = "bert-base-cased" with lang_model = "bert-base-german-cased" and run.

Maybe I am missing additional configurations that are needed to finetune german bert? I hope you can help :)

System:

OS: Ubuntu
GPU/CPU: CPU (no training involved in issue)
FARM version: 0.7.1

felixvor commented 3 years ago

It is because the [MASK] token is in line 6 (index 5) of vocab.txt from bert-base-german-cased, and not index 103 like for bert-base-cased. Maybe I can look into creating a PR for this.

Quick question, line 1710+ of processor.py:

        # 1. Combine tokens to one group (e.g. all subtokens of a word)
        cand_indices = []
        for (i, token) in enumerate(tokens):
            if token == 101 or token == 102 or token == 0:
                continue

Would 101 and 102 be [SEP] and [CLS]?

julian-risch commented 3 years ago

Hi @DieseKartoffel yes, you are right, 101 is [SEP] token and 102 is [CLS] token. I quickly checked that here.

julian-risch commented 3 years ago

Maybe I can look into creating a PR for this. That would be great! Please let me know how it goes and if you would like to discuss any steps. You are correct that the different vocabularies and indices of special tokens 101, 102 and 103 are causing the problem.

julian-risch commented 3 years ago

@DieseKartoffel Thank you for your contribution to FARM! 👍 Your changes are merged now.

deepset-ai / FARM

BertStyleLMProcessor not working with german bert model #800