Naive cardinality estimator for a triple pattern using ScyllaDB's estimates

The approach is inspired by the M. Stocker's work.

The cardinality of a triple pattern (i.e. card(t)) is equal to card(t) = card(s) * card(p) * card(o).

The card(s) is based on an estimate from ScyllaDB which maintains "Number of partitions (estimate)" for each table, so we can count the number of distinct subjects by the number of partitions in the S_POC table. Thus card(s) = card(*) / <number of partitions in S_POC> where card(*) is the total number of triples in triplestore.

Same as card(s), the card(o) is based on the estimate from ScyllaDB for the O_SPC table. Thus card(o) = card(*) / <number of partitions in O_SPC>.

The card(p) is fetched from a separate index where each p has the number of triples containing it. Such index is quite small (e.g. for http://tree.datafabric.cc it's just ~680Kb on disk) and easy to maintain (it's a counter which can be incremented and decremented). But still it's only an estimate if insert the same triple 2 times, then the counter is incremented 2 times as well.

If both p and o are bound, then instead of card(o) we use card(p, o). The card(p, o) denotes the number of triples containing both p and o. It's fetched from a separate index where for each p there is a histogram.

The same way, to estimate the card(*) and in future card(c), a separate index is maintained, similarly to the card(p)'s one.

Limitations:

Assumes independence of the subject, predicate and object cardinalities.
Assumes that each subject and object have the same number of relations.
Doesn't use the cardinality of the context if it's bound, i.e. card(c).
Requires to do nodetool flush after any write to ScyllaDB, so it'd recompute the estimates.

Estimates from ScyllaDB are queried through the JMX API which's also used by nodetool.

DataFabricRus / scylla-rdf

Naive cardinality estimator for a triple pattern using ScyllaDB's estimates #1