Open tonybaloney opened 1 year ago
cc @ks6088ts for insights on whether they've seen this with their Japanese users
@tonybaloney @pamelafox
Great suggestion :)
It is true that the script doesn't always punctuate Japanese sentences.
Currently TextSplitter has word_breaks
settings internally.
IMO, it would be better to inject those settings from a constructor just as LangChain::CharacterTextSplitter does and add some descriptions about customization for respective languages.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this issue will be closed.
More work beyond prepdocs is needed to make this app multi-lingual.
https://learn.microsoft.com/en-us/azure/search/search-language-support
We also need to adjust our splitter to be token based. Currently if you split a Chinese document at 1000 characters, you can't even fit three chunks in a single ChatCompletion call.
This demo app does work with other languages than English, however the prepdocs script makes some assumptions about the input characters.
For example, Japanese doesn't always punctuate sentences with a period and the symbol 。is more common than the ASCII . There are also quote marks, like 「 」 and the use of the triangle brackets ⟨ ⟩. Also the comma is a different unicode character.
We could use this CJK punctuation chart as a starting point and read the encoding of the input file.
I have no knowledge of other languages, Wikipedia is suggesting Hebrew and Arabic languages have some special punctuation but I don't know how exhaustive this is https://en.wikipedia.org/wiki/Category:Punctuation_of_specific_languages