deltachat / deltachat-core-rust

Delta Chat Rust Core library, used by Android/iOS/desktop apps, bindings and bots 📧
https://delta.chat/en/contribute
Other
659 stars 85 forks source link

Search is case sensitive for non-ASCII messages #5052

Closed link2xt closed 3 months ago

link2xt commented 10 months ago

In SQLite search with LIKE is case-insensitive: https://github.com/deltachat/deltachat-core-rust/blob/b779d08d7f7d02d7b9570e4b0cf87ccc2e255517/src/context.rs#L933

This however does not work for non-english letters. Using functions like UPPER also does not help as they only work for ASCII and using a function on txt column will prevent using indexes anyway.

It is a well-known problem: https://shallowdepth.online/posts/2022/01/5-ways-to-implement-case-insensitive-search-in-sqlite-with-full-unicode-support/

The solution is to create a new column txt_normalized defaulting to NULL (so we do not bump up the database size in a migration) and storing lowercased/normalized text there when the row is created. When doing a search, search over IFNULL(txt_normalized, txt).

r10s commented 10 months ago

that would double the space needed for stored text, maybe okay when compared to what blobs need.

what would be another consideration for having an extra field for searching is that we could also add other things there - as names of files or webxdc apps - currently, you cannot search for that. appending these information to txt_normalized can make that accessible without degrading performance

that said, there is also FULLTEXT and ICU extensions, that may be alternatives (ICU is mentioned in the linked post to be complicated, however, not sure if that also applied to rust). for FULLTEXT, it is still astonishing how far and fast we can go without

link2xt commented 10 months ago

ICU is mentioned in the linked post to be complicated, however, not sure if that also applied to rust

It is complicated, we currently just use vendored SQLCipher and managing our own version of SQLite with SQLCipher and additional extensions is going to make compiling the core more difficult for everyone. There is an issue in rusqlite repo: https://github.com/rusqlite/rusqlite/issues/531

link2xt commented 9 months ago

This migration passes by the way:

+    if dbversion < 106 {
+        sql.execute_migration(
+            r"CREATE VIRTUAL TABLE search USING fts5(text)",
+            106
+        ).await?;
+    }

So we can use https://www.sqlite.org/fts5.html

iequidoo commented 7 months ago

Tried FTS5, but this approach has its own problems. We should use MATCH instead of LIKE to perform a case-insensitive search, but then we lose a capability of matching against parts of words (if we use the unicode61 tokeniser). We can use the trigram tokeniser then for substring matching, but anyway MATCH should be used for case-insensitive search (it's not stated explicitly in the docs, but i failed to get LIKE working). That means we should escape the search pattern correctly (when using LIKE we just pass %pattern% f.e.). And also queries of fewer than 3 chars wouldn't work:

Substrings consisting of fewer than 3 unicode characters do not match any rows when used with a full-text query.

I think the most simple way that doesn't break the current UX also is what @link2xt suggested initially:

The solution is to create a new column txt_normalized defaulting to NULL (so we do not bump up the database size in a migration) and storing lowercased/normalized text there when the row is created. When doing a search, search over IFNULL(txt_normalized, txt).

Moreover, if a normalised text doesn't differ from the source one or contains only ASCII, we can omit storing it thus reducing the db size.

EDIT: Note that matching against parts of words is critical for CJK langs, i don't know if there are any tokenisers for them.