Prep docs assumes certain punctuation

Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.

https://azure.microsoft.com/products/search

MIT License

6.11k stars 4.16k forks source link

Prep docs assumes certain punctuation #867

Open tonybaloney opened 1 year ago

tonybaloney commented 1 year ago

This demo app does work with other languages than English, however the prepdocs script makes some assumptions about the input characters.

For example, Japanese doesn't always punctuate sentences with a period and the symbol 。is more common than the ASCII . There are also quote marks, like 「　」 and the use of the triangle brackets ⟨　⟩. Also the comma is a different unicode character.

We could use this CJK punctuation chart as a starting point and read the encoding of the input file.

I have no knowledge of other languages, Wikipedia is suggesting Hebrew and Arabic languages have some special punctuation but I don't know how exhaustive this is https://en.wikipedia.org/wiki/Category:Punctuation_of_specific_languages

pamelafox commented 1 year ago

cc @ks6088ts for insights on whether they've seen this with their Japanese users

ks6088ts commented 1 year ago

@tonybaloney @pamelafox Great suggestion :) It is true that the script doesn't always punctuate Japanese sentences. Currently TextSplitter has word_breaks settings internally. IMO, it would be better to inject those settings from a constructor just as LangChain::CharacterTextSplitter does and add some descriptions about customization for respective languages.

ks6088ts commented 11 months ago

@tonybaloney @pamelafox

add some descriptions about customization for respective languages.

Providing information to use NLP related OSS, like spaCy, NLTK on README is my suggestion. In fact, LangChain provides text splitter interfaces for them.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.

mattgotteiner commented 8 months ago

More work beyond prepdocs is needed to make this app multi-lingual.

https://learn.microsoft.com/en-us/azure/search/search-language-support

pamelafox commented 8 months ago

We also need to adjust our splitter to be token based. Currently if you split a Chinese document at 1000 characters, you can't even fit three chunks in a single ChatCompletion call.