crate / cratedb-guide

The CrateDB Guide.
https://cratedb.com/docs/guide/
Apache License 2.0
0 stars 0 forks source link

Backlog for "Features" section #101

Open amotl opened 3 months ago

amotl commented 3 months ago

About

Coming from GH-53, there are a few backlog items, and there will be more.

Details

amotl commented 3 months ago

Discussion about Indexes

Hi guys, I'd like to understand better CrateDB indexes, I have a few questions, thanks in advance.

[1] https://cratedb.com/blog/indexing-and-storage-in-cratedb [2] https://rockset.com/blog/converged-indexing-the-secret-sauce-behind-rocksets-fast-queries/

Questions

  1. In [1] we say that "Inverted Indexes for text values, BKD-Trees for numeric values, and Doc Values." Is this still accurate in 2024 or do we implement any other data structure?
  2. If I understood well in [1], given a column with default index (plain), do we build into an inverted index + columnar (doc values) for text values, for example, or is doc values reserved only for things like objects/arrays? Maybe another way of putting the question is, do we only use one index data structure per datatype/column, or do we apply more than one and then maybe choose to query one or another depending of the user's query in the optimizer?
  3. In [1] "...new documents are added to the existing index, they are added to the next segment ...the system may decide to merge some segments ...adding a new document does not require rebuilding the index structure" Is this the reason why our index-all strategy does not affect insert performance as much as other databases when you manually set to build index in several columns? or is there any other strategy in place to mitigate the overhead of indexing every column?
  4. In [2] they defined converged index as: row (LSM trees) + columnar + search (posting lists), how fair would be to say that we also 'implement' converged index in your opinion?
  5. In [1] "...CrateDB implements Column Store based on Doc Values in Lucene" Does this mean that we just use Lucene's DocValues or do we wrote our own based/inspired on it?

Answers

  1. This is still accurate.
  2. Everything gets doc values; numeric and geo types also get a BKD index; text and index fields also get postings lists. And yes, which index structure is used depends on the query.
  3. Indexing is in general fast because it uses lucene, which is optimized for fast writes. The segment-based structure means that all index files in a segment are written just once (with the exception of deletes), and segment merging is fast (mostly just concatenation) and done asynchronously - this is an old but still useful way of thinking about how the index is written.
  4. We don't use LSM trees, we use BKD trees which are I guess sort of similar so you could argue that we have something which at the least 'looks like' a converged index.
  5. We use the lucene implementations, with one minor change to use a best-speed rather than a best-compression algorithm in text field doc values.

Thoughts

References

amotl commented 3 months ago

At crate-clients-tools, specifically CI run #9710088632.

WARNING: undefined label: 'guide:metrics'
amotl commented 3 months ago

@surister suggested at https://github.com/crate/cratedb-guide/pull/106#issuecomment-2258388007:

At the beginning of the page about Hybrid Search, you talk about how vectors is not enough hence we need to mix with bm25, this is very well written in the description of https://haystackconf.com/us2023/talk-16/, maybe it can serve as an inspiration?

Thanks!

amotl commented 3 months ago

My personal immediate favourite backlog items for the All Features at a Glance page would be: