cayleygraph / cayley

An open-source graph database
https://cayley.io
Apache License 2.0
14.83k stars 1.25k forks source link

Use HyperLogLog to Help the Optimizier #643

Open barakmich opened 6 years ago

barakmich commented 6 years ago

HyperLogLog allows you to, in a fixed number of bits, keep track of the cardinality of a set. It works a little like a Bloom filter...

hll := NewHLL()
hll.Add("foo")
hll.Add("foo") // Duplicates ignored
hll.Add("bar")

x := hll.Size() // 2

The proposal is this: As a new index -- but only for nodes which appear in the predicate field -- we keep some new data

bucket "hll":
  ... 
  "</film/film/starring>": { subject_hll: bytes, object_hll: bytes}
  ...
  ...

And when a new quad comes in:

</en/some_film> </film/film/starring> <_:32423> .

We add the subject and object to the respective HLLs in the index.

pred.subject_hll.Add("</en/some_film>")
pred.object_hll.Add("<_:32423>")

Now, currently, we assume a fanout of 20 * underlying when trying to estimate the Size of a LinksTo. Instead, we can do much better, if we know the predicate:

max(pred.subject_hll.Size() / pred.object_hll.Size(), 1)

Or the reciprocal, depending on the direction of the link.

This means predicates like "name" will have a fan-out of 1 -- people generally only have one name. However, a given name will have a slightly larger fan-in -- maybe 1.5 -- as some names have more than one person (the "John Smith" effect)

dennwc commented 6 years ago

This will be very helpful, since the "name" case is very typical for GraphQL (most properties has one value), Gizmo (Has, Save), etc.

For query optimizer, if it knows that a predicate has exactly one value, it's always better to do lookup/check for that predicate upfront. It can then re-run optimization again to properly estimate fan-out after computing specific nodes in the resulting set.

Even better, since HLL supports merge/intersection, we may try to merge multiple filters at runtime to estimate fan-out for intersections of multiple LinksTo.

The only problem as I see now - it does not support deletion, and modification of such filter might require some tricks with transactions in backend implementation.