deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.47k stars 1.9k forks source link

How to use XLM-R as retriever correctly? #506

Closed khalidbhs closed 4 years ago

khalidbhs commented 4 years ago

I'm trying to use xlm-r-100langs-bert-base-nli-stsb-mean-tokens as retriever with

retriever = EmbeddingRetriever(document_store=document_store, embedding_model='xlm-r-100langs-bert-base-nli-stsb-mean-tokens', model_format='sentence_transformers')

when I try to embed a text with retriever.embed('test'), it raises this error:

/usr/local/lib/python3.6/dist-packages/transformers/modeling_utils.py in get_extended_attention_mask(self, attention_mask, input_shape, device)
    260             raise ValueError(
    261                 "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
--> 262                     input_shape, attention_mask.shape
    263                 )
    264             )

ValueError: Wrong shape for input_ids (shape torch.Size([4])) or attention_mask (shape torch.Size([4]))

I also tried to use the model from huggingface model hub:

retriever = EmbeddingRetriever(document_store=document_store, embedding_model='sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens', model_format='transformers')

but it raises this error:

TypeError                                 Traceback (most recent call last)

<ipython-input-34-0b021b13e848> in <module>()
      1 from haystack.retriever.dense import EmbeddingRetriever
----> 2 retriever = EmbeddingRetriever(document_store=document_store, embedding_model='sentence-transformers/xlm-r-100langs-bert-base-nli-stsb-mean-tokens', model_format='transformers')

6 frames

/usr/local/lib/python3.6/dist-packages/haystack/retriever/dense.py in __init__(self, document_store, embedding_model, use_gpu, model_format, pooling_strategy, emb_extraction_layer)
    300             self.embedding_model = Inferencer.load(
    301                 embedding_model, task_type="embeddings", extraction_strategy=self.pooling_strategy,
--> 302                 extraction_layer=self.emb_extraction_layer, gpu=use_gpu, batch_size=4, max_seq_len=512, num_processes=0
    303             )
    304 

/usr/local/lib/python3.6/dist-packages/farm/infer.py in load(cls, model_name_or_path, batch_size, gpu, task_type, return_class_probs, strict, max_seq_len, doc_stride, extraction_layer, extraction_strategy, s3e_stats, num_processes, disable_tqdm, tokenizer_class, use_fast, tokenizer_args, dummy_ph, benchmarking)
    271                                        tokenizer_class=tokenizer_class,
    272                                        use_fast=use_fast,
--> 273                                        **tokenizer_args,
    274                                        )
    275 

/usr/local/lib/python3.6/dist-packages/farm/modeling/tokenization.py in load(cls, pretrained_model_name_or_path, tokenizer_class, use_fast, **kwargs)
    131                 ret = BertTokenizerFast.from_pretrained(pretrained_model_name_or_path, **kwargs)
    132             else:
--> 133                 ret = BertTokenizer.from_pretrained(pretrained_model_name_or_path, **kwargs)
    134         elif tokenizer_class == "XLNetTokenizer":
    135             if use_fast:

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in from_pretrained(cls, *inputs, **kwargs)
   1423 
   1424         """
-> 1425         return cls._from_pretrained(*inputs, **kwargs)
   1426 
   1427     @classmethod

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils_base.py in _from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   1570         # Instantiate tokenizer.
   1571         try:
-> 1572             tokenizer = cls(*init_inputs, **init_kwargs)
   1573         except OSError:
   1574             raise OSError(

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_bert.py in __init__(self, vocab_file, do_lower_case, do_basic_tokenize, never_split, unk_token, sep_token, pad_token, cls_token, mask_token, tokenize_chinese_chars, strip_accents, **kwargs)
    189         )
    190 
--> 191         if not os.path.isfile(vocab_file):
    192             raise ValueError(
    193                 "Can't find a vocabulary file at path '{}'. To load the vocabulary from a Google pretrained "

/usr/lib/python3.6/genericpath.py in isfile(path)
     28     """Test whether a path is a regular file"""
     29     try:
---> 30         st = os.stat(path)
     31     except OSError:
     32         return False

TypeError: stat: path should be string, bytes, os.PathLike or integer, not NoneType

Any advice how to use the xlm-r-100langs-bert-base-nli-stsb-mean-tokens model correctly?

tholor commented 4 years ago

Thanks for reporting this bug @khalidbhs !

  1. For model_format='transformers': Edge case where the model name contains "bert" and therefore the BERT tokenizer is loaded instead of the XLMR tokenizer. We'll fix this.

  2. For model_format='sentence-transformers': Not clear yet what is happening here. Possibly some version issues. What transformers + sentence-transformers version do you use?

khalidbhs commented 4 years ago

Cool, thanks! I used transformers==3.1.0 and sentence-transformers==0.3.4 Here's the full stack trace for model_format='sentence-transformers' btw:

ValueError                                Traceback (most recent call last)

<ipython-input-24-d3ac4f6b76bb> in <module>()
----> 1 retriever.embed('test')

7 frames

/usr/local/lib/python3.6/dist-packages/haystack/retriever/dense.py in embed(self, texts)
    347             # text is single string, sentence-transformers needs a list of strings
    348             # get back list of numpy embedding vectors
--> 349             emb = self.embedding_model.encode(texts)  # type: ignore
    350             emb = [r for r in emb]
    351         return emb

/usr/local/lib/python3.6/dist-packages/sentence_transformers/SentenceTransformer.py in encode(self, sentences, batch_size, show_progress_bar, output_value, convert_to_numpy, convert_to_tensor, is_pretokenized, device, num_workers)
    150 
    151             with torch.no_grad():
--> 152                 out_features = self.forward(features)
    153                 embeddings = out_features[output_value]
    154 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/container.py in forward(self, input)
    115     def forward(self, input):
    116         for module in self:
--> 117             input = module(input)
    118         return input
    119 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/sentence_transformers/models/Transformer.py in forward(self, features)
     33     def forward(self, features):
     34         """Returns token_embeddings, cls_token"""
---> 35         output_states = self.auto_model(**features)
     36         output_tokens = output_states[0]
     37 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
    720             result = self._slow_forward(*input, **kwargs)
    721         else:
--> 722             result = self.forward(*input, **kwargs)
    723         for hook in itertools.chain(
    724                 _global_forward_hooks.values(),

/usr/local/lib/python3.6/dist-packages/transformers/modeling_bert.py in forward(self, input_ids, attention_mask, token_type_ids, position_ids, head_mask, inputs_embeds, encoder_hidden_states, encoder_attention_mask, output_attentions, output_hidden_states, return_dict)
    802         # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
    803         # ourselves in which case we just need to make it broadcastable to all heads.
--> 804         extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
    805 
    806         # If a 2D ou 3D attention mask is provided for the cross-attention

/usr/local/lib/python3.6/dist-packages/transformers/modeling_utils.py in get_extended_attention_mask(self, attention_mask, input_shape, device)
    260             raise ValueError(
    261                 "Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(
--> 262                     input_shape, attention_mask.shape
    263                 )
    264             )

ValueError: Wrong shape for input_ids (shape torch.Size([4])) or attention_mask (shape torch.Size([4]))
tholor commented 4 years ago

Fixing this in https://github.com/deepset-ai/FARM/issues/571

bogdankostic commented 4 years ago

Fixed in https://github.com/deepset-ai/FARM/pull/600

tholor commented 4 years ago

Will be available in Haystack within the next days (after the FARM release) or you install the latest FARM version from master manually.

khalidbhs commented 4 years ago

Great, I've tried it and it's working now, thanks!

tholor commented 4 years ago

Perfect, thanks for the feedback!