Backlog for "Features" section

amotl commented 3 months ago

About

Coming from GH-53, there are a few backlog items, and there will be more.

Details

[ ] Features: Improve feature pages which are a bit thin, yet.
- Origin: https://github.com/crate/cratedb-guide/pull/53#pullrequestreview-1933287728
[ ] Feature / Search: Improve layout
- Origin: https://github.com/crate/cratedb-guide/pull/53#discussion_r1529344885 by @surister
[ ] Feature / Vector: Add example using euclidean distance function
- Origin: https://github.com/crate/cratedb-guide/pull/53#discussion_r1685430878 by @amotl
[ ] Feature / Geo: What about querying with "donut" shapes?
- Origin: https://github.com/crate/cratedb-guide/pull/53#pullrequestreview-2160610407 by @seut
[x] Feature / Search: Hybrid Search
- Origin: https://github.com/crate/cratedb-guide/pull/53#pullrequestreview-2191183546
- Worklog: Add article https://cratedb.com/blog/hybrid-search-in-cratedb by @surister
- Done: Fixed with https://github.com/crate/cratedb-guide/pull/106
[x] Feature / Index: Hybrid Index
- Origin: https://cratedb.com/docs/guide/feature/index/
- Worklog: Use singular "Hybrid Index" instead of the plural form by @geragray
- Done: Fixed with c4fba3a711.

amotl commented 3 months ago

Discussion about Indexes

Hi guys, I'd like to understand better CrateDB indexes, I have a few questions, thanks in advance.

[1] https://cratedb.com/blog/indexing-and-storage-in-cratedb [2] https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/

Questions

In [1] we say that "Inverted Indexes for text values, BKD-Trees for numeric values, and Doc Values." Is this still accurate in 2024 or do we implement any other data structure?
If I understood well in [1], given a column with default index (plain), do we build into an inverted index + columnar (doc values) for text values, for example, or is doc values reserved only for things like objects/arrays? Maybe another way of putting the question is, do we only use one index data structure per datatype/column, or do we apply more than one and then maybe choose to query one or another depending of the user's query in the optimizer?
In [1] "...new documents are added to the existing index, they are added to the next segment ...the system may decide to merge some segments ...adding a new document does not require rebuilding the index structure" Is this the reason why our index-all strategy does not affect insert performance as much as other databases when you manually set to build index in several columns? or is there any other strategy in place to mitigate the overhead of indexing every column?
In [2] they defined converged index as: row (LSM trees) + columnar + search (posting lists), how fair would be to say that we also 'implement' converged index in your opinion?
In [1] "...CrateDB implements Column Store based on Doc Values in Lucene" Does this mean that we just use Lucene's DocValues or do we wrote our own based/inspired on it?

Answers

This is still accurate.
Everything gets doc values; numeric and geo types also get a BKD index; text and index fields also get postings lists. And yes, which index structure is used depends on the query.
Indexing is in general fast because it uses lucene, which is optimized for fast writes. The segment-based structure means that all index files in a segment are written just once (with the exception of deletes), and segment merging is fast (mostly just concatenation) and done asynchronously - this is an old but still useful way of thinking about how the index is written.
We don't use LSM trees, we use BKD trees which are I guess sort of similar so you could argue that we have something which at the least 'looks like' a converged index.
We use the lucene implementations, with one minor change to use a best-speed rather than a best-compression algorithm in text field doc values.

Thoughts

Thanks for that Q&A, @surister and @romseygeek.
Slot into the "feature/storage" section, in one way or another.

References

https://github.com/crate/cratedb-guide/pull/53#discussion_r1685551074

amotl commented 3 months ago

At crate-clients-tools, specifically CI run #9710088632.

WARNING: undefined label: 'guide:metrics'

amotl commented 3 months ago

At the beginning of the page about Hybrid Search, you talk about how vectors is not enough hence we need to mix with bm25, this is very well written in the description of https://haystackconf.com/us2023/talk-16/, maybe it can serve as an inspiration?

Thanks!

amotl commented 3 months ago

My personal immediate favourite backlog items for the All Features at a Glance page would be:

[ ] Improve "Highlights" section: Add info cards about Hybrid Index and Hybrid Search. Add performance details, like the recent blog post by Henrik about it.
[ ] Think about renaming »Document Store« => »Document / JSON«.
[ ] Think about renaming »Relational / JOINs« => »Distributed Joins«.

crate / cratedb-guide

Backlog for "Features" section #101

About

Details

Discussion about Indexes

Questions

Answers

Thoughts

References