codenotary / immudb

immudb - immutable database based on zero trust, SQL/Key-Value/Document model, tamperproof, data change history
https://immudb.io
Other
8.52k stars 337 forks source link

Composite index of "STRING" type #1754

Open SimoneLazzaris opened 11 months ago

SimoneLazzaris commented 11 months ago

Creating composing index on document collections is not currently working with STRING types: it let you create the index, but the an error occurs when a document is inserted.

prog8 commented 7 months ago

The problem is related to a way how the index key is created during the upsert. https://github.com/codenotary/immudb/blob/master/embedded/sql/stmt.go#L981-L999

First of all the ingredients of the index are encoded. For varchar type the value is encoded in a way that it first gets bytes from the string, then value is filled with 0 bytes and ad the end there is simply the lenght of the string. With the default settings encoded varchar ends up having 517 bytes. Later when the final key is created all the ingredients are concatenated. If we have 2 string columns used to build the index we end up having more than 517+517 bytes. In next steps there is a check if the key is not exceeding the maximum allowed size which defaults to 1024. And here is "kaboom".

First of all knowing the length limitations the solution could be to build a key from real characters (without filling with zero bytes). But of course this solution wouldn't be correct. This is because in such a way "foo" (from column1), "bar" (from column2) wouldn't be distinguishable from "fo" (from column1), "obar" (from column2). Another solution is to use a separator byte between columns in index and ignore the fill with zero bytes but still a user can really insert long strings (as long as 512 bytes) and we'll hit the same issue.

SimoneLazzaris commented 7 months ago

Good catch. I think that another possible solution could be to use the first n character of the string, then a separator and a strong hash (e.g.: sha256). n should be chosen so that n + 1 (separator) + hash_length = 512. In that way we can guarantee uniqueness and sorting, except for the pathologically long string.