HazyResearch / ddlog

Compiler for writing DeepDive applications in a Datalog-like language — ⚠️🚧🛑 REPO MOVED TO DEEPDIVE 👇🏿
https://github.com/HazyResearch/deepdive/tree/master/compiler/ddlog
19 stars 4 forks source link

ddlog introducing unnecessary grounding complexity? #72

Closed zifeishan closed 8 years ago

zifeishan commented 8 years ago

ddlog requires creating a new variable table and join it with other tables every time, thus introducing unnecessary complexity in grounding. Here are some query examples:

In these examples, locations_el is the original variable table, link is the variable table compiled by ddlog.

Factor: "two locations mentioned in a same doc mapped to near locations"

Query compiled by ddlog:

SELECT R0.id AS "link.R0.id" ,
       R1.id AS "link.R1.id"
FROM link R0,
     link R1,
     locations_el R2,
     locations_el R3
WHERE R1.doc_id = R0.doc_id
  AND R2.doc_id = R0.doc_id
  AND R2.mention_id = R0.mention_id
  AND R2.loc_id = R0.loc_id
  AND R3.doc_id = R0.doc_id
  AND R3.mention_id = R1.mention_id
  AND R3.loc_id = R1.loc_id
  AND abs(R2.latitude - R3.latitude) < 20
  AND abs(R2.longitude - R3.longitude) < 20
  AND R0.mention_id < R1.mention_id
  AND earth_distance(R2.latitude, R2.longitude, R3.latitude, R3.longitude) < 200

Ideal query without ddlog:

SELECT R0.id AS "locations_el.R0.id" ,
       R1.id AS "locations_el.R1.id"
FROM locations_el R0,
     locations_el R1
WHERE R1.doc_id = R0.doc_id
  AND R0.mention_id < R1.mention_id
  AND abs(R0.latitude - R1.latitude) < 20
  AND abs(R0.longitude - R1.longitude) < 20
  AND earth_distance(R0.latitude, R0.longitude, R1.latitude, R1.longitude) < 200

Factor: "one mention cannot map to two entities".

original ddlog:

@weight(-10)
link(doc_id, mid, loc_id1) ^ link(doc_id, mid, loc_id2) :-
  locations_el(doc_id, mid, loc_id1, _, _, _, _, _, _, _),
  locations_el(doc_id, mid, loc_id2, _, _, _, _, _, _, _),
  [loc_id1 < loc_id2].

Compiled query in ddlog:

SELECT R0.id AS "link.R0.id" ,
       R1.id AS "link.R1.id"
FROM link R0,
     link R1,
     locations_el R2,
     locations_el R3
WHERE R1.doc_id = R0.doc_id
  AND R1.mention_id = R0.mention_id
  AND R2.doc_id = R0.doc_id
  AND R2.mention_id = R0.mention_id
  AND R2.loc_id = R0.loc_id
  AND R3.doc_id = R0.doc_id
  AND R3.mention_id = R0.mention_id
  AND R3.loc_id = R1.loc_id
  AND R0.loc_id < R1.loc_id

Ideal query without ddlog:

SELECT R0.id AS "locations_el.R0.id" ,
       R1.id AS "locations_el.R1.id"
FROM locations_el R0,
     locations_el R1
WHERE R1.doc_id = R0.doc_id
  AND R1.mention_id = R0.mention_id
  AND R0.loc_id < R1.loc_id

Or even using the new variable table link, ideally this query do not need joining with other tables. Ideally I would write:

@weight(-10)
link(doc_id, mid, loc_id1) ^ link(doc_id, mid, loc_id2) :-
  [loc_id1 < loc_id2].

But it seems not supported since loc_id1 and loc_id2 do not have "bindings".

In summary, I am not sure if these kinds of flexibility is something we want to support, since it could break the design of ddlog. In cases when we need optimized performance, should we always consider using the original conf file?

@netj @alldefector

chrismre commented 8 years ago

No! We should fix this =)

On Fri, Jan 22, 2016 at 12:15 PM Zifei Shan notifications@github.com wrote:

ddlog requires creating a new variable table and join it with other tables every time, thus introducing unnecessary complexity in grounding. Here are some query examples:

In these examples, locations_el is the original variable table, link is the variable table compiled by ddlog. Factor: "two locations mentioned in a same doc mapped to near locations"

Query compiled by ddlog:

SELECT R0.id AS "link.R0.id" , R1.id AS "link.R1.id" FROM link R0, link R1, locations_el R2, locations_el R3 WHERE R1.doc_id = R0.doc_id AND R2.doc_id = R0.doc_id AND R2.mention_id = R0.mention_id AND R2.loc_id = R0.loc_id AND R3.doc_id = R0.doc_id AND R3.mention_id = R1.mention_id AND R3.loc_id = R1.loc_id AND abs(R2.latitude - R3.latitude) < 20 AND abs(R2.longitude - R3.longitude) < 20 AND R0.mention_id < R1.mention_id AND earth_distance(R2.latitude, R2.longitude, R3.latitude, R3.longitude) < 200

Ideal query without ddlog:

SELECT R0.id AS "locations_el.R0.id" , R1.id AS "locations_el.R1.id" FROM locations_el R0, locations_el R1 WHERE R1.doc_id = R0.doc_id AND R0.mention_id < R1.mention_id AND abs(R0.latitude - R1.latitude) < 20 AND abs(R0.longitude - R1.longitude) < 20 AND earth_distance(R0.latitude, R0.longitude, R1.latitude, R1.longitude) < 200

Factor: "one mention cannot map to two entities".

original ddlog:

@weight(-10) link(doc_id, mid, loc_id1) ^ link(doc_id, mid, loc_id2) :- locations_el(doc_id, mid, locid1, , , , , , , ), locations_el(doc_id, mid, locid2, , , , , , , ), [loc_id1 < loc_id2].

Compiled query in ddlog:

SELECT R0.id AS "link.R0.id" , R1.id AS "link.R1.id" FROM link R0, link R1, locations_el R2, locations_el R3 WHERE R1.doc_id = R0.doc_id AND R1.mention_id = R0.mention_id AND R2.doc_id = R0.doc_id AND R2.mention_id = R0.mention_id AND R2.loc_id = R0.loc_id AND R3.doc_id = R0.doc_id AND R3.mention_id = R0.mention_id AND R3.loc_id = R1.loc_id AND R0.loc_id < R1.loc_id

Ideal query without ddlog:

SELECT R0.id AS "locations_el.R0.id" , R1.id AS "locations_el.R1.id" FROM locations_el R0, locations_el R1 WHERE R1.doc_id = R0.doc_id AND R1.mention_id = R0.mention_id AND R0.loc_id < R1.loc_id

Or even using the new variable table link, ideally this query do not need joining with other tables. Ideally I would write:

@weight(-10) link(doc_id, mid, loc_id1) ^ link(doc_id, mid, loc_id2) :- [loc_id1 < loc_id2].

But it seems not supported since loc_id1 and loc_id2 do not have "bindings".

In summary, I am not sure if these kinds of flexibility is something we want to support, since it could break the design of ddlog. In cases when we need optimized performance, should we always consider using the original conf file?

@netj https://github.com/netj @alldefector https://github.com/alldefector

— Reply to this email directly or view it on GitHub https://github.com/HazyResearch/ddlog/issues/72.

netj commented 8 years ago

We should certainly fix the compiler to get rid of this inefficiency. It seems all that needs to be done is to remove the joins with the variable tables, right? Earlier, we had related design questions over mapping DDlog language constructs to the variable tables, scoping rules, supervision rules, and inference rules that give rise to factors in DeepDive. The decisions we made then is probably suboptimal, overloading some constructs with others in a less obvious way, and it seems we should sit down again to carefully revise or add language constructs, so we can compile in a principled way while still keeping what needs to be written minimal.

@zifeishan On the last example, you're assuming the domain (or scope) of link() is somehow determined by another rule, right? I think ideally the rules should collectively give rise to the domain of these variables (or predicates), but it was a much larger departure from the deepdive.conf SQL we were targeting to compile, so we decided to keep the leap small. We ended up treating the supervision rules (with @label(...)) also as scoping rules that fill up the variable table first, then kept the inference rules to create factors only among them to prevent the compiled deepdive.conf from creating dangling factors. We can probably flip the order, ground the factors and variables that are rooted on other tables first, then ground the factors that aren't rooted, derive and ground the variables that appear on any factor, repeat until no change. But finding a clean DAG of processes upfront will be certainly better, so some more static analysis over DDlog should go into the compiler.