chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.54k stars 1.21k forks source link

[Bug]: Querying with certain foreign language data is failing to return correctly with sqlite3 fts5 tokenizer='trigram' #1073

Open wojang-ziumks opened 1 year ago

wojang-ziumks commented 1 year ago

What happened?

For certain Asian languages e.g. Korean, (and potentially Chinese/Japanese), sqlite3 tokenizers don't quite work well without ICU support.

So testing it out on the sqlite file with fts5 tokenizer=trigram, it looks like trigrams are working correctly to catch the data whenever at least 3 characters are searched for:

SELECT * FROM embedding_fulltext_search WHERE string_value LIKE '%순매수%';

returns: 주요 신흥 4개국 증시 외국인투자자 순매수

but

SELECT * FROM embedding_fulltext_search WHERE string_value LIKE '%순매%';

doesn't return anything.

having tried fts5 porter and unicode61 tokenizers, they only catch terms with white space separators: CREATE VIRTUAL TABLE tascii USING fts5(x, tokenize = 'ascii'); CREATE VIRTUAL TABLE tuni USING fts5(x, tokenize = 'unicode61');

select from tuni('신흥'); select from tascii ('신흥');

returns 주요 신흥 4개국 증시 외국인투자자 순매수액

correctly but

select from tuni('순매'); select from tascii('순매');

fails to return the same data.

But fortunately, when using the above WHERE x LIKE query that fails for trigrams it returns correctly:

select from tuni WHERE x LIKE '%순매%'; select from tascii WHERE x LIKE '%순매%';

returns

주요 신흥 4개국 증시 외국인투자자 순매수액

potential fix for queries failing to return the correct data for the above languages might include switching fts5 tokenizers and possibly the way they are queried. One problem being I don't know how this will impact performance or other languages.

Best case scenario is somehow integrating ICU tokenizer into chromadb sqlite usage but it requires larger effort to do so.

Versions

chroma 0.4.8 python 3.10.11

Relevant log output

No response

tazarov commented 1 year ago

Thanks, @wojang-ziumks; I think we need to think this through, but generally, adding unicode61 tokenizer shouldn't have much impact on other language supports, maybe just a little bit of additional CPU overhead.

Sample for the change:

CREATE VIRTUAL TABLE embedding_fulltext USING fts5(id, string_value,tokenize = 'unicode61');

I can do the change and run some local tests to verify the above.

tazarov commented 1 year ago

@wojang-ziumks, I had a discussion with @HammadB about this. And while the suggested solution above works the issue is that if we outright add this, it will break existing deployments. So now we're considering how we can let users pick their own sqlite tokenizer and possible migration paths.

HammadB commented 1 year ago

@jeffchuber This is the same as supporting custom indices

tazarov commented 12 months ago

Ideally this will be part of #1125

h3clikejava commented 2 weeks ago

It’s been a year, and the issue still exists.