FARMReader Cannot use inconsistently named model

klys-equinix commented 4 years ago

I am trying to use https://huggingface.co/henryk/bert-base-multilingual-cased-finetuned-polish-squad2 model in the FARMReader. The model is named rather unfortunately, as it's name contains both 'mutlilingual' and 'polish'. As a result, i get following error:

ValueError: Could not automatically detect from language model name what language it is.
     Found multiple matches: ['polish', 'multilingual']
     Please init the language model by manually supplying the 'language' as a parameter.

Please add an option to pass model language as arguments. Here is full stacktrace:

<ipython-input-25-518aa04f83bc> in <module>
      1 document_store = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document")
----> 2 retriever = FARMReader(model_name_or_path="henryk/bert-base-multilingual-cased-finetuned-polish-squad2",
      3                             use_gpu=True, no_ans_boost=-10, context_window_size=500,
      4                             top_k_per_candidate=3, top_k_per_sample=1,
      5                             num_processes=8, max_seq_len=256, doc_stride=128)

~/.local/lib/python3.8/site-packages/haystack/reader/farm.py in __init__(self, model_name_or_path, context_window_size, batch_size, use_gpu, no_ans_boost, top_k_per_candidate, top_k_per_sample, num_processes, max_seq_len, doc_stride)
     85             self.return_no_answers = True
     86         self.top_k_per_candidate = top_k_per_candidate
---> 87         self.inferencer = Inferencer.load(model_name_or_path, batch_size=batch_size, gpu=use_gpu,
     88                                           task_type="question_answering", max_seq_len=max_seq_len,
     89                                           doc_stride=doc_stride, num_processes=num_processes)

~/.local/lib/python3.8/site-packages/farm/infer.py in load(cls, model_name_or_path, batch_size, gpu, task_type, return_class_probs, strict, max_seq_len, doc_stride, extraction_layer, extraction_strategy, s3e_stats, num_processes, disable_tqdm)
    207                                  "'question_answering', 'embeddings', 'text_classification', 'ner'")
    208 
--> 209             model = AdaptiveModel.convert_from_transformers(model_name_or_path, device, task_type)
    210             config = AutoConfig.from_pretrained(model_name_or_path)
    211             tokenizer = Tokenizer.load(model_name_or_path)

~/.local/lib/python3.8/site-packages/farm/modeling/adaptive_model.py in convert_from_transformers(cls, model_name_or_path, device, task_type, processor)
    578         :return: AdaptiveModel
    579         """
--> 580         lm = LanguageModel.load(model_name_or_path)
    581         #TODO Infer type of head automatically from config
    582 

~/.local/lib/python3.8/site-packages/farm/modeling/language_model.py in load(cls, pretrained_model_name_or_path, n_added_tokens, language_model_class, **kwargs)
    143 
    144             if language_model_class:
--> 145                 language_model = cls.subclasses[language_model_class].load(pretrained_model_name_or_path, **kwargs)
    146             else:
    147                 language_model = None

~/.local/lib/python3.8/site-packages/farm/modeling/language_model.py in load(cls, pretrained_model_name_or_path, language, **kwargs)
    389             # Pytorch-transformer Style
    390             bert.model = BertModel.from_pretrained(str(pretrained_model_name_or_path), **kwargs)
--> 391             bert.language = cls._get_or_infer_language_from_name(language, pretrained_model_name_or_path)
    392         return bert
    393 

~/.local/lib/python3.8/site-packages/farm/modeling/language_model.py in _get_or_infer_language_from_name(cls, language, name)
    218             return language
    219         else:
--> 220             return cls._infer_language_from_name(name)
    221 
    222     @classmethod

~/.local/lib/python3.8/site-packages/farm/modeling/language_model.py in _infer_language_from_name(cls, name)
    241             )
    242         elif len(matches) > 1:
--> 243             raise ValueError(
    244                 "Could not automatically detect from language model name what language it is.\n"
    245                 f"\t Found multiple matches: {matches}\n"

ValueError: Could not automatically detect from language model name what language it is.
     Found multiple matches: ['polish', 'multilingual']
     Please init the language model by manually supplying the 'language' as a parameter.

Timoeller commented 4 years ago

This bug is unfortunate and will require some time to be fixed, because we need to update FARM first.

One solution would be to use a TransformersReader instead of FARMReader with:


reader = TransformersReader(
    model="henryk/bert-base-multilingual-cased-finetuned-polish-squad2",
    tokenizer="henryk/bert-base-multilingual-cased-finetuned-polish-squad2", 
    use_gpu=-1)

klys-equinix commented 4 years ago

Thx for the reply. I have been playing around with the TransformerReader too, but in my experience FARMReader is more useful. Will wait for the fix. Also, I have been trying to use FARM directly by

model = LanguageModel.load("henryk/bert-base-multilingual-cased-finetuned-polish-squad2", language='polish')
nlp = Inferencer.load(model, 
               task_type='question_answering', gpu=True)

which obviously didn't work because one can't just pass model to Inferencer.load(). But the Inferencer.load() also does not provide an option to pass a language param so you could use the model without LanguageModel.load. Can you also update the Inferencer.load in FARM for that?

brandenchan commented 4 years ago

Hi, in the latest version of FARM that ValueError that you were getting is now just a warning. Could you install the latest version of Haystack (which in turn installs a later version of FARM) and try again?

klys-equinix commented 4 years ago

Hi, just came back to project. As of now the problem no longer appears, so I am closing the issue

deepset-ai / haystack

FARMReader Cannot use inconsistently named model #140