freedmand / semantra

Multi-tool for semantic search
MIT License
2.49k stars 139 forks source link

Error encountered while using model "sentence-transformers/all-mpnet-base-v2" #31

Closed ASDpaper closed 1 year ago

ASDpaper commented 1 year ago

Hello, I am encountering an error while trying to use the model "sentence-transformers/all-mpnet-base-v2" in a script.

(base) PS F:\Download> semantra hamlet.pdf Traceback (most recent call last): File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\utils_errors.py", line 259, in hf_raise_for_status response.raise_for_status() File "E:\software\Anaconnda\lib\site-packages\requests\models.py", line 1021, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/tokenizer_config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "E:\software\Anaconnda\lib\site-packages\transformers\utils\hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\utils_validators.py", line 120, in _inner_fn return fn(*args, *kwargs) File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\file_download.py", line 1195, in hf_hub_download metadata = get_hf_file_metadata( File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\utils_validators.py", line 120, in _inner_fn return fn(args, **kwargs) File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\file_download.py", line 1541, in get_hf_file_metadata hf_raise_for_status(r) File "E:\software\Anaconnda\lib\site-packages\huggingface_hub\utils_errors.py", line 291, in hf_raise_for_status raise RepositoryNotFoundError(message, response) from e huggingface_hub.utils._errors.RepositoryNotFoundError: 404 Client Error. (Request ID: Root=1-64599c33-4e9b1c4a0839489551a6eee6)

Repository Not Found for url: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/tokenizer_config.json. Please make sure you specified the correct repo_id and repo_type. If you are trying to access a private or gated repo, make sure you are authenticated.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "E:\software\Anaconnda\lib\runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "E:\software\Anaconnda\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "E:\software\Anaconnda\Scripts\semantra.exe__main.py", line 7, in File "E:\software\Anaconnda\lib\site-packages\click\core.py", line 1130, in call return self.main(*args, kwargs) File "E:\software\Anaconnda\lib\site-packages\click\core.py", line 1055, in main rv = self.invoke(ctx) File "E:\software\Anaconnda\lib\site-packages\click\core.py", line 1404, in invoke return ctx.invoke(self.callback, ctx.params) File "E:\software\Anaconnda\lib\site-packages\click\core.py", line 760, in invoke return callback(*args, kwargs) File "E:\software\Anaconnda\lib\site-packages\semantra\semantra.py", line 598, in main model: BaseModel = model_config["get_model"]() File "E:\software\Anaconnda\lib\site-packages\semantra\models.py", line 334, in "get_model": lambda: TransformerModel(model_name=mpnet_model_name), File "E:\software\Anaconnda\lib\site-packages\semantra\models.py", line 166, in init self.tokenizer = AutoTokenizer.from_pretrained(model_name) File "E:\software\Anaconnda\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 642, in from_pretrained tokenizer_config = get_tokenizer_config(pretrained_model_name_or_path, kwargs) File "E:\software\Anaconnda\lib\site-packages\transformers\models\auto\tokenization_auto.py", line 486, in get_tokenizer_config resolved_config_file = cached_file( File "E:\software\Anaconnda\lib\site-packages\transformers\utils\hub.py", line 424, in cached_file raise EnvironmentError( OSError: sentence-transformers/all-mpnet-base-v2 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models' If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True.

I have attempted to resolve the issue by ensuring I am using the correct model identifier and checking my internet access. I have also tried logging in with the Hugging Face CLI before running the script.

However, the error persists. Any assistance in resolving this issue would be greatly appreciated.

Environment information:

Guruprasad93 commented 1 year ago

Yes! I'm running into the same issue.

Sentence-transformers models - do not run with the usual AutoTokenizer.from_pretrained(); they instead need to import a different class (SentenceTransformer(model_name))

So I think the code base needs to be updated to incorporate that change. THe latest version of HuggingFace doesn't support that.

WangYenChieh commented 1 year ago

The same here. Looks like not an exceptional issue.

Guruprasad93 commented 1 year ago

The fix is:

In the src/models.py : you can replace this line:

minilm_model_name = "sentence-transformers/all-MiniLM-L6-v2"

with this:

minilm_model_name = "obrizum/all-MiniLM-L6-v2"

and then in your terminal run:

semantra --model minilm

ASDpaper commented 1 year ago

The fix is:

In the src/models.py : you can replace this line:

minilm_model_name = "sentence-transformers/all-MiniLM-L6-v2"

with this:

minilm_model_name = "obrizum/all-MiniLM-L6-v2"

and then in your terminal run:

semantra --model minilm

Thank you very much!

freedmand commented 1 year ago

See https://github.com/freedmand/semantra/issues/32#issuecomment-1540149884

This is not an issue with Semantra, but rather the services that host the models. The default model should work if you try again later, once the status of Huggingface and Github show that they are operational.

In the src/models.py : you can replace this line:

minilm_model_name = "sentence-transformers/all-MiniLM-L6-v2"

with this:

minilm_model_name = "obrizum/all-MiniLM-L6-v2"

For future reference, you can do this without any code changes by passing --transformer-model. See https://github.com/freedmand/semantra/blob/main/docs/guide_models.md#using-custom-models

freedmand commented 1 year ago

See also https://github.com/UKPLab/sentence-transformers/issues/1915