dgraph-io / dgraph

The high-performance database for modern applications
https://dgraph.io
Other
20.4k stars 1.5k forks source link

[FEATURE]: Optimize use of type(T) and eq(dgraph.type, T) #8587

Closed damonfeldman closed 2 months ago

damonfeldman commented 1 year ago

Have you tried Dgraph before this proposal? and did not find anything similar?

None

What you wanted to do.

use types freely in queries without performance impact

What you actually did.

Types often slow down queries.

The dgraph.type predicate table is often the largest in the system, because every node has a type (often/typically). Other predicates are restricted to one type. So if we have 5M Users with 4M Tasks, all 9M have a dgraph.type, but only the 4M users have a UserName. We should be extra careful about how we process dgraph.type.

In particular

q(func:eq(someID, 0x1234)) @filter(type(T)) { ...}

often (always?) runs slow. It seems to retrieve both index structures in parallel (type and maching UIDs for the someID) and intersect them, but would be much more efficient if the eq() check was used first, and then the one or very few UIDs was used to look up the dgraph.type for only those nodes.

Why wasn't it great, with examples.

Ran slow.

Additional information.

Workarounds:

perform more restrictive sub-queries first and assign to a var, then do type checks later e.g.

x = var(func:eq(someId, "234324")
q(uid(x)) @filter(type(T))

should run fast.

Alternatively: use a unique predicate for a particular type to find the type without using the actual type() function. E.g. q(has(userName)) @filter(<other conditions>) {...} will run faster if only User nodes have the userName predicate, and all User nodes reliably have that predicate.

amaster507 commented 1 year ago

Ref: https://discuss.dgraph.io/t/rfc-proposal-for-change-in-type-system/17983/3?u=amaster507

damonfeldman commented 1 year ago

Near duplicate (or duplicate) of #8502

damonfeldman commented 1 year ago

Note from discuss discussions. One idea is to use the "edge" processing to figure out types, vs using indexes. Another from mrjain is to push this question to query planning, so some process looks at edges vs indices vs other approaches, vs doing the lookups at a later phase of processing. I like query planning so that if we add new data structures (ontologies, some fast distributed lookup of UID->type other than a posting list etc.) the query planner can pick that up later.

github-actions[bot] commented 3 months ago

This issue has been stale for 60 days and will be closed automatically in 7 days. Comment to keep it open.

harshil-goel commented 2 months ago

This has been fixed in v24