Closed MohamedISSA98 closed 5 months ago
@MohamedISSA98 Thanks for your feedback! We will investigate and update as appropriate.
Hello @MohamedISSA98 The preprocessing step of removing punctuation and special characters is a common technique used in natural language processing to help improve the accuracy of models. This is because special characters and punctuation can often be noise in the data, and can make it more difficult for models to accurately predict the meaning of the text.
Regarding your question about why the tokenizer doesn't handle unknown tokens, it is important to note that tokenization is just one step in the process of natural language processing. While tokenization can help to break text down into smaller, more manageable pieces, it does not necessarily handle all types of unknown tokens or special characters. This is why additional preprocessing steps may be necessary to help improve the accuracy of models.
It is possible that the LangChain library that you used for indexing is designed to handle special characters and unknown tokens in a different way than the text-embedding-ada model. Without more information about the specific implementation of the LangChain library, it is difficult to say for certain why you did not encounter this issue when using that library.
I hope this helps to clarify the issue. If there are any further questions regarding the documentation, please tag me in your reply and we will be happy to continue the conversation.
@MohamedISSA98 We are going to close this thread, if there are any further questions regarding the documentation, please tag me in your reply and we will be happy to continue the conversation.
@Naveenommi-MSFT Thank for your reply. I noticed that embeddings are qutie sensetive to the special character '\n' which is already removed in th preprocessing function of the documentation. Although removing punctuations as part of preprocessing might be useful to get accurate embeddings, keeping them should not cause an error since the tokenizer can map punctuations to their corresponding IDs (which is confirmed with tiktoken). The error I'm getting says: numpy is not installed / base64 optimisation isn't enabled for this model yet Since I have numpy installed, I suspect the issue is related to base64 optimisation. When I change encoding format to 'float' i get a html error page.
@MohamedISSA98 The error message suggests that the base64 optimization is not enabled for the model, which could be causing the issue. You can try using a different encoding format or checking the model configuration to see if there are any issues with the base64 optimization. If still getting error you need to help with this, please log a case here: https://learn.microsoft.com/en-us/azure/azure-portal/supportability/how-to-create-azure-support-request. Thank you.
I'm getting an error on few strings when trying to embed them with "text-embedding-ada".
I have around 125 texts with a maximum tokens size of 625. When performing embedding, I got an error on some texts (str object has no attribute data) when accessing the embedding with (client.embeddings.create(input = [text], model=model).data[0].embedding). In the documentation, I saw that there is a preprocessing of removing punctuations and special characters. I tried that and it worked. But I don't understand why we should perform this type of preprocessing prior to embedding.
Why doesn't the tokenizer handle the unkonw tokens?
I used the same model before for indexing with LangChain library using OpenAIEmbeddings class and didn't encounter this issue. Could you explain this issue?
https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/embeddings?tabs=python-new%2Ccommand-line&pivots=programming-language-python
Document Details
⚠ Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.