UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.44k stars 2.5k forks source link

Please future prove `clean_up_tokenization_spaces` #2922

Open PhorstenkampFuzzy opened 2 months ago

PhorstenkampFuzzy commented 2 months ago

This is the future warning we are currently reciving:

transformers\tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884

pesuchin commented 2 months ago

This Warning occurs when the kwargs in AutoTokenizer.from_pretrained does not include the clean_up_tokenization_spaces setting value. https://github.com/huggingface/transformers/blob/47b096412da9cbeb9351806e9f0eb70a693b2859/src/transformers/tokenization_utils_base.py#L1601-L1607

To prevent this Warning from being issued, clean_up_tokenization_spaces needs to be added to all AutoTokenizer.from_pretrained calls used within sentence_transformers. Currently, the default value is True, so specifying clean_up_tokenization_spaces=True can avoid the Warning.

For example, we can confirm that the warning is being generated in examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py.

$ python examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py
/Users/username/project/sentence-transformers/.venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1600: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
  warnings.warn(

To avoid the Warning in this process, clean_up_tokenization_spaces=True needs to be specified in tokenizer_kwargs in train_stsb_tsdae.py, and the following three DenoisingAutoEncoderLoss calls need to be modified:

The Warning message states that the default value will be changed to False in future versions, but according to this pull request, clean_up_tokenization_spaces=True seems to be necessary for BERT-based models, so it's unlikely that the default value will be changed to False.

https://github.com/huggingface/transformers/pull/31938

Therefore, it seems that no action is needed for BERT-based models. However, one concern is that with the current specification of sentence_transformers, a Warning will be issued unless the user explicitly specifies clean_up_tokenization_spaces=True.

I'm not sure about the full scope of impact, so I can't say whether this response is correct, but I believe this approach would avoid the Warning.

ArthurZucker commented 2 months ago

cc @tomaarsen if you need insight on that tell me!

tomaarsen commented 2 months ago

@ArthurZucker I'm considering following @pesuchin 's recommendation and adding clean_up_tokenization_spaces=True to avoid the warnings, but I'm very wary that hardcoding this option would create incompatibilities if some future transformers models are trained with clean_up_tokenization_spaces=False. If that model is then loaded into Sentence Transformers (with clean_up_tokenization_spaces=True) then the model is suddenly a lot worse. (I'm assuming here that the tokenization spaces affect tokens)

I think hardcoding clean_up_tokenization_spaces=True would fail because of it. Would love to hear what you think.

cc @itazap

itazap commented 2 months ago

Hey! This is the future PR to deprecate to False by default: https://github.com/huggingface/transformers/pull/31938 and it will keep clean_up_tokenization_spaces=True for models that require it (such as Bert-based and some others --> see modified files in PR)

In terms of future models, the clean_up_tokenization_spaces function itself is arbitrarily stripping whitespace (post tokenization), so I would say it is a good practice for future models to need to have it explicitly set if that is the intention, or better yet have it be part of the tokenize logic directly. Let me know what you think 😄

pradip292 commented 2 months ago

Please how to solve this error ?

itazap commented 2 months ago

@pradip292 what is the error you are experiencing? This warning is expected in order to communicate the future deprecation

pradip292 commented 2 months ago

@pradip292 what is the error you are experiencing? This warning is expected in order to communicate the future deprecation

after this warning my streamlit is automatically going stop waring is like this :- FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn(

itazap commented 2 months ago

@pradip292 Can you paste the error? Perhaps your streamlit needs to suppress warnings to render but the warning being present shouldn't result in an error

pradip292 commented 2 months ago

@pradip292 Can you paste the error? Perhaps your streamlit needs to suppress warnings to render but the warning being present shouldn't result in an error

i did it but still same error i am facing

PhorstenkampFuzzy commented 2 months ago

I am getting the warning but my application i working perfectly fine. Are you sure your problem is related to the warning? @pradip292

pradip292 commented 2 months ago

I am getting the warning but my application i working perfectly fine. Are you sure your problem is related to the warning? @pradip292

i will check and update u after some time

SDArtz commented 2 months ago

clean_up_tokenization_spaceswas not set /Library/Frameworks/Python.framew ![Screenshot 2024-09-18 at 9 59 14 AM](https://github.com/user-attachments/assets/e834cb71-9228-48a1-8c8d-805249ae31b5) ork/Versions/3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning:clean_up_tokenization_spaceswas not set. It will be set toTrueby default. This behavior will be deprecated in transformers v4.45, and will be then set toFalse` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884 warnings.warn(

itazap commented 2 months ago

@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error

pradip292 commented 2 months ago

@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error

now this is another error i am facing i have tried many options but it is still showing this error only what should i do -> raise SSLError(e, request=request) requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/all-mpnet-base-v2/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))"), '(Request ID: 52c928fd-ec37-4b18-ab6a-5f11209595a2)')

tomaarsen commented 2 months ago

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?

It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

tomaarsen commented 2 months ago

@itazap @ArthurZucker

Hey! This is the future PR to deprecate to False by default: huggingface/transformers#31938 and it will keep clean_up_tokenization_spaces=True for models that require it (such as Bert-based and some others --> see modified files in PR)

In terms of future models, the clean_up_tokenization_spaces function itself is arbitrarily stripping whitespace (post tokenization), so I would say it is a good practice for future models to need to have it explicitly set if that is the intention, or better yet have it be part of the tokenize logic directly. Let me know what you think 😄

Thanks for the answer! This sounds like I should indeed defer to transformers and not hardcode anything in Sentence Transformers, under the impression that you will keep it as True for models for which it's required (and thus prevent any breaking changes). Please correct me if I'm wrong. In that case, I should just wait for a new transformers version where the deprecation is merged and released?

pradip292 commented 2 months ago

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?

It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

itazap commented 2 months ago

@tomaarsen yes exactly - the deprecation will maintain True for models that require it ! 😊

Calabrone76 commented 2 months ago

@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error

now this is another error i am facing i have tried many options but it is still showing this error only what should i do -> raise SSLError(e, request=request) requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/all-mpnet-base-v2/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))"), '(Request ID: 52c928fd-ec37-4b18-ab6a-5f11209595a2)')

This is probably caused by a firewall (man in the middle) that changes SSL certificates. We had similar errors with our company firewall. Check this out.

keshavvarma commented 1 month ago

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

pradip292 commented 1 month ago

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

keshavvarma commented 1 month ago

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

What is the solution? can you please let me know

pradip292 commented 1 month ago

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

What is the solution? can you please let me know

Actually that model is not working so i am using other model of hunggingface there are different models are out there. and about ssl errors just need to download some files that code used, that was remain, i have taken the help of chatgpt and i able to solve that error. -> sorry for my english i am student

keshavvarma commented 1 month ago

@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces warning.

  • Tom Aarsen

Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please

I am also facing the same issue on my laptop and don't know how to solve.

Issue is solved :-)

What is the solution? can you please let me know

Actually that model is not working so i am using other model of hunggingface there are different models are out there. and about ssl errors just need to download some files that code used, that was remain, i have taken the help of chatgpt and i able to solve that error. -> sorry for my english i am student

Okay, I will try it out, thanks!!

RisingInsight commented 1 month ago
    self.tokenizer = BertTokenizer.from_pretrained(pretrained_bert_name, clean_up_tokenization_spaces=False)
RisingInsight commented 1 month ago
    self.tokenizer = BertTokenizer.from_pretrained(pretrained_bert_name, clean_up_tokenization_spaces=False)

我是直接在BertTokenizer中加入了clean_up_tokenization_spaces=False

PhorstenkampFuzzy commented 1 month ago

Any news on a solution for the original issue?

keshavvarma commented 1 month ago

Not yet, for the time being I changed the vector database to FAISS and Groq model to llama3-8b-8192