Open PhorstenkampFuzzy opened 2 months ago
This Warning occurs when the kwargs in AutoTokenizer.from_pretrained does not include the clean_up_tokenization_spaces setting value. https://github.com/huggingface/transformers/blob/47b096412da9cbeb9351806e9f0eb70a693b2859/src/transformers/tokenization_utils_base.py#L1601-L1607
To prevent this Warning from being issued, clean_up_tokenization_spaces needs to be added to all AutoTokenizer.from_pretrained calls used within sentence_transformers.
Currently, the default value is True, so specifying clean_up_tokenization_spaces=True
can avoid the Warning.
For example, we can confirm that the warning is being generated in examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py.
$ python examples/unsupervised_learning/TSDAE/train_stsb_tsdae.py
/Users/username/project/sentence-transformers/.venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py:1600: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
To avoid the Warning in this process, clean_up_tokenization_spaces=True
needs to be specified in tokenizer_kwargs in train_stsb_tsdae.py, and the following three DenoisingAutoEncoderLoss calls need to be modified:
clean_up_tokenization_spaces=True
to AutoTokenizer.from_pretrained
clean_up_tokenization_spaces=True
from the batch_decode call.
The Warning message states that the default value will be changed to False in future versions, but according to this pull request, clean_up_tokenization_spaces=True
seems to be necessary for BERT-based models, so it's unlikely that the default value will be changed to False.
https://github.com/huggingface/transformers/pull/31938
Therefore, it seems that no action is needed for BERT-based models.
However, one concern is that with the current specification of sentence_transformers, a Warning will be issued unless the user explicitly specifies clean_up_tokenization_spaces=True
.
clean_up_tokenization_spaces=True
by default:
I'm not sure about the full scope of impact, so I can't say whether this response is correct, but I believe this approach would avoid the Warning.
cc @tomaarsen if you need insight on that tell me!
@ArthurZucker I'm considering following @pesuchin 's recommendation and adding clean_up_tokenization_spaces=True
to avoid the warnings, but I'm very wary that hardcoding this option would create incompatibilities if some future transformers
models are trained with clean_up_tokenization_spaces=False
. If that model is then loaded into Sentence Transformers (with clean_up_tokenization_spaces=True
) then the model is suddenly a lot worse. (I'm assuming here that the tokenization spaces affect tokens)
I think hardcoding clean_up_tokenization_spaces=True
would fail because of it. Would love to hear what you think.
cc @itazap
Hey! This is the future PR to deprecate to False
by default: https://github.com/huggingface/transformers/pull/31938 and it will keep clean_up_tokenization_spaces=True
for models that require it (such as Bert-based and some others --> see modified files in PR)
In terms of future models, the clean_up_tokenization_spaces
function itself is arbitrarily stripping whitespace (post tokenization), so I would say it is a good practice for future models to need to have it explicitly set if that is the intention, or better yet have it be part of the tokenize logic directly. Let me know what you think 😄
Please how to solve this error ?
@pradip292 what is the error you are experiencing? This warning is expected in order to communicate the future deprecation
@pradip292 what is the error you are experiencing? This warning is expected in order to communicate the future deprecation
after this warning my streamlit is automatically going stop
waring is like this :- FutureWarning: clean_up_tokenization_spaces
was not set. It will be set to True
by default. This behavior will be depracted in transformers v4.45, and will be then set to False
by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
@pradip292 Can you paste the error? Perhaps your streamlit needs to suppress warnings to render but the warning being present shouldn't result in an error
@pradip292 Can you paste the error? Perhaps your streamlit needs to suppress warnings to render but the warning being present shouldn't result in an error
i did it but still same error i am facing
I am getting the warning but my application i working perfectly fine. Are you sure your problem is related to the warning? @pradip292
I am getting the warning but my application i working perfectly fine. Are you sure your problem is related to the warning? @pradip292
i will check and update u after some time
clean_up_tokenization_spaceswas not set /Library/Frameworks/Python.framew ![Screenshot 2024-09-18 at 9 59 14 AM](https://github.com/user-attachments/assets/e834cb71-9228-48a1-8c8d-805249ae31b5) ork/Versions/3.10/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1617: FutureWarning:
clean_up_tokenization_spaceswas not set. It will be set to
Trueby default. This behavior will be deprecated in transformers v4.45, and will be then set to
False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error
@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error
now this is another error i am facing i have tried many options but it is still showing this error only what should i do -> raise SSLError(e, request=request) requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/all-mpnet-base-v2/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))"), '(Request ID: 52c928fd-ec37-4b18-ab6a-5f11209595a2)')
@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the clean_up_tokenization_spaces
warning.
@itazap @ArthurZucker
Hey! This is the future PR to deprecate to
False
by default: huggingface/transformers#31938 and it will keepclean_up_tokenization_spaces=True
for models that require it (such as Bert-based and some others --> see modified files in PR)In terms of future models, the
clean_up_tokenization_spaces
function itself is arbitrarily stripping whitespace (post tokenization), so I would say it is a good practice for future models to need to have it explicitly set if that is the intention, or better yet have it be part of the tokenize logic directly. Let me know what you think 😄
Thanks for the answer! This sounds like I should indeed defer to transformers
and not hardcode anything in Sentence Transformers, under the impression that you will keep it as True
for models for which it's required (and thus prevent any breaking changes). Please correct me if I'm wrong.
In that case, I should just wait for a new transformers
version where the deprecation is merged and released?
@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ?
It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the
clean_up_tokenization_spaces
warning.
- Tom Aarsen
Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please
@tomaarsen yes exactly - the deprecation will maintain True
for models that require it ! 😊
@SDArtz @pradip292 thanks for providing the output, this output is an exptected warning that we want to display, it is not an error
now this is another error i am facing i have tried many options but it is still showing this error only what should i do -> raise SSLError(e, request=request) requests.exceptions.SSLError: (MaxRetryError("HTTPSConnectionPool(host='huggingface.co', port=443): Max retries exceeded with url: /sentence-transformers/all-mpnet-base-v2/resolve/main/config.json (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1000)')))"), '(Request ID: 52c928fd-ec37-4b18-ab6a-5f11209595a2)')
This is probably caused by a firewall (man in the middle) that changes SSL certificates. We had similar errors with our company firewall. Check this out.
@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the
clean_up_tokenization_spaces
warning.
- Tom Aarsen
Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please
I am also facing the same issue on my laptop and don't know how to solve.
@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the
clean_up_tokenization_spaces
warning.
- Tom Aarsen
Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please
I am also facing the same issue on my laptop and don't know how to solve.
Issue is solved :-)
@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the
clean_up_tokenization_spaces
warning.
- Tom Aarsen
Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please
I am also facing the same issue on my laptop and don't know how to solve.
Issue is solved :-)
What is the solution? can you please let me know
@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the
clean_up_tokenization_spaces
warning.
- Tom Aarsen
Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please
I am also facing the same issue on my laptop and don't know how to solve.
Issue is solved :-)
What is the solution? can you please let me know
Actually that model is not working so i am using other model of hunggingface there are different models are out there. and about ssl errors just need to download some files that code used, that was remain, i have taken the help of chatgpt and i able to solve that error. -> sorry for my english i am student
@pradip292 Are you able to browse to this URL: https://huggingface.co/sentence-transformers/all-mpnet-base-v2/resolve/main/config.json ? It seems that you were (temporarily or otherwise) unable to automatically download this file, which is required to initialize the model. It is unrelated to the
clean_up_tokenization_spaces
warning.
- Tom Aarsen
Yes i am facing that issue in my laptop only, my friends who tried there laptops it worked but i am facing this issue i dont know how to deal with it i have downloaded all the ssl file and all but still, help me please
I am also facing the same issue on my laptop and don't know how to solve.
Issue is solved :-)
What is the solution? can you please let me know
Actually that model is not working so i am using other model of hunggingface there are different models are out there. and about ssl errors just need to download some files that code used, that was remain, i have taken the help of chatgpt and i able to solve that error. -> sorry for my english i am student
Okay, I will try it out, thanks!!
self.tokenizer = BertTokenizer.from_pretrained(pretrained_bert_name, clean_up_tokenization_spaces=False)
self.tokenizer = BertTokenizer.from_pretrained(pretrained_bert_name, clean_up_tokenization_spaces=False)
我是直接在BertTokenizerä¸åŠ 入了clean_up_tokenization_spaces=False
Any news on a solution for the original issue?
Not yet, for the time being I changed the vector database to FAISS and Groq model to llama3-8b-8192
This is the future warning we are currently reciving:
transformers\tokenization_utils_base.py:1601: FutureWarning:
clean_up_tokenization_spaces
was not set. It will be set toTrue
by default. This behavior will be depracted in transformers v4.45, and will be then set toFalse
by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884