elastic / eland

Python Client and Toolkit for DataFrames, Big Data, Machine Learning and ETL in Elasticsearch
https://eland.readthedocs.io
Apache License 2.0
18 stars 98 forks source link

Support for importing model #687

Closed mauiroma closed 2 months ago

mauiroma commented 6 months ago

Using Eland to load the model https://huggingface.co/osiria/minilm-l12-h384-italian-cased , I obtain this error

Establishing connection to Elasticsearch
Connected to cluster named 'xxxxxxxx' (version: 8.13.2)
Loading HuggingFace transformer tokenizer and model 'osiria/minilm-l12-h384-italian-cased'
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2256, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/models/xlm_roberta/tokenization_xlm_roberta.py", line 154, in __init__
    self.sp_model.Load(str(vocab_file))
  File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 961, in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.10/site-packages/sentencepiece/__init__.py", line 316, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
OSError: Not found: "None": No such file or directory Error #2

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/bin/eland_import_hub_model", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.10/site-packages/eland/cli/eland_import_hub_model.py", line 269, in main
    tm = TransformerModel(
  File "/usr/local/lib/python3.10/site-packages/eland/ml/pytorch/transformers.py", line 655, in __init__
    self._tokenizer = transformers.AutoTokenizer.from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/models/auto/tokenization_auto.py", line 768, in from_pretrained
    return tokenizer_class.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2024, in from_pretrained
    return cls._from_pretrained(
  File "/usr/local/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 2258, in _from_pretrained
    raise OSError(
OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

The command used is this:

eland_import_hub_model --cloud-id '$CLOUD_ID'   --es-username $USERNAME --es-password '$PASSWORD'  --es-model-id 'osirialegal'  --hub-model-id osiria/minilm-l12-h384-italian-cased
davidkyle commented 6 months ago

The failure is due to the missing file sentencepiece.bpe.model. The error is easy to reproduce with AutoTokenizer.from_pretrained using the slow tokenizer:

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained('osiria/minilm-l12-h384-italian-cased', use_fast=False)

Returns the error:

OSError: Unable to load vocabulary from file. Please check that the provided vocabulary is accessible and not corrupted.

With the fast tokenizer loading the tokenizer does not error, I'm assuming the fast tokenizer downloads the sentencepiece model automatically.

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained('osiria/minilm-l12-h384-italian-cased', use_fast=True)

Eland needs to use the slow tokenizer. One option is to take sentencepiece.bpe.model from the xlm-roberta-base repo and add it to ''osiria/minilm-l12-h384-italian-cased'. To do this first git clone https://huggingface.co/osiria/minilm-l12-h384-italian-cased (you will need to install Git LFS) then add sentencepiece.bpe.model to the cloned repo.

federicocesarini1 commented 6 months ago

@davidkyle i followed your instructions but now i get another error: torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) IndexError: index out of range in self

is there another way to make it work?

federicocesarini1 commented 6 months ago

@davidkyle this is the complete error:

Exception has occurred: IndexError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
index out of range in self
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/functional.py", line 2264, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 163, in forward
    return F.embedding(
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 126, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 830, in forward
    embedding_output = self.embeddings(
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/sentence_transformers/models/Transformer.py", line 98, in forward
    output_states = self.auto_model(**trans_features, return_dict=False)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/container.py", line 217, in forward
    input = module(input)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/eland/ml/pytorch/transformers.py", line 387, in forward
    return self._st_model(inputs)[self._output_key]
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/myvenv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/federico/Desktop/work2/eland/eland/ml/pytorch/transformers.py", line 471, in sample_output
    return self._model(*inputs)
  File "/home/federico/Desktop/work2/eland/eland/ml/pytorch/transformers.py", line 786, in _create_config
    sample_embedding = self._traceable_model.sample_output()
  File "/home/federico/Desktop/work2/eland/eland/ml/pytorch/transformers.py", line 669, in __init__
    self._config = self._create_config(es_version)
  File "/home/federico/Desktop/work2/eland/eland/cli/eland_import_hub_model.py", line 269, in main
    tm = TransformerModel(
  File "/home/federico/Desktop/work2/eland/eland/cli/eland_import_hub_model.py", line 334, in <module>
    main()
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,
IndexError: index out of range in self
davidkyle commented 6 months ago

Thanks for the stack trace.

When Eland is used to import the model it runs a test evaluation to measure the size of the embedding produced by the model. This is part of the config and useful when configuring the dims parameter of the dense vector field mapping in Elasticsearch. The error is from this test evaluation, in this case it is probably because the inputs to forward(...) not in the expected format.

I should able to reproduce this

davidkyle commented 5 months ago

I looked into this again and there is another issue that prevents this model being used in Elasticsearch.

Elasticsearch uses the libtorch C++ library to run the NLP models. The models must be converted to the TorchScript format before they can run in libtorch, this conversion is one of the things the Eland script does. Tracing this particular model fails with an error:

RuntimeError: Encountering a dict at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.

Here is a Python snippet that reproduces the failed trace operation. I used a local copy of the repository with the sentencepiece.bpe.model added as mention above

from transformers import AutoModel, AutoTokenizer
import torch
# load model & tokenizer
tokenizer = AutoTokenizer.from_pretrained('<directory of the downloaded model to which we added sentencepiece>', use_fast=False)
model = AutoModel.from_pretrained('<directory of the downloaded model to which we added sentencepiece>')
# create sample input
encoded_input = tokenizer("Replace me by any text you'd like.", return_tensors='pt')
trace_inputs = (encoded_input["input_ids"], encoded_input["attention_mask"])
# trace model fails
traced = torch.jit.trace(model, example_inputs=trace_inputs)

Closing this issue as if the model cannot be traced it cannot be supported.

serenachou commented 2 months ago

@arisonl @davidkyle @dimkots - just FYI competitors like Weaviate support Jina AI models. Is there a plan to address this for us?

davidkyle commented 2 months ago

The general problem of importing a model without a vocabulary file is tracked in #721. Closing this issue as there are other problems with this specific model (osiria/minilm-l12-h384-italian-cased) which means it cannot be imported.