blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
872 stars 170 forks source link

Strange behaviour with aggregate function #213

Open johnnymoretti opened 2 years ago

johnnymoretti commented 2 years ago

Hi, I'm using a local instance of Blazegraph 2.1.6 RC and I've noticed that if I submit a query with an aggregate function like group_concat it hangs unless I add a limit instruction.

The following query hangs

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) 
where{
  ?lemma ontolex:writtenRep ?wr
} group by ?lemma 

This other one does not hang

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) 
where{
  ?lemma ontolex:writtenRep ?wr
} group by ?lemma 
limit 1000000000

Obviously my dataset is smaller than 1000000000: it contains 138245 triples.

Any suggestion ?

thompsonbry commented 2 years ago

A few thoughts from Michael:

Tough one… best guess: hash code collision issue, where we are doing materialization at a different time in the case of LIMIT (and hash code over materialized vs unmaterialized values somehow differ)?

There is this logic in AST2BOp of using materializing iterator vs materializing explicitly via an operator, which depends on LIMIT and order preservation requirements. I’d guess that this choice causes different plan structures for non-LIMIT vs LIMIT, which triggers the hash problem.

You could likely verify this by verifying the as generated plan variation and then drilling into the actual hash codes being used in the tail segment of the plan for the variant without the LIMIT.

Bryan

On Wed, Nov 17, 2021 at 10:30 AM Giovanni Moretti @.***> wrote:

Hi, I'm using a local instance of Blazegraph 2.1.6 RC and I've noticed that if I submit a query with an aggregate function like group_concat it hangs unless I add a limit instruction.

The following query hangs

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) where{ ?lemma ontolex:writtenRep ?wr } group by ?lemma

This other one does not hang

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) where{ ?lemma ontolex:writtenRep ?wr } group by ?lemma limit 1000000000

Obviously my dataset is smaller than 1000000000: it contains 138245 triples.

Any suggestion ?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/213, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YH4R6NOPMA2E63KAE3UMPYEHANCNFSM5IHWAWYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

johnnymoretti commented 2 years ago

Thank you for your reply and ok I understand the problem. In order to correct this behavior and clarify if the problem is related to the materialization, there is a simple way to force the materialization ?

johnnymoretti commented 2 years ago

Ok it is definitely a materialization problem. I set the materializeProjectionInQuery variable in the AST2BOpContext class to false and thus I implicitly forced the projection to be materialized outside the plan: in this way the queries work. However, the issue about materialization remains open. Thanks a lot for your help.

johnnymoretti commented 2 years ago

I have another strange thing here. The following query now works with the previously mentioned trick :

PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) 
where{
  ?lemma ontolex:writtenRep ?wr
} group by ?lemma 

but this other one stucks again:

PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) 
where{
  ?lemma ontolex:writtenRep ?wr
} group by ?lemma 
order by ?wrs

to fix this problem I have to write the query like this:

PREFIX ontolex: <http://www.w3.org/ns/lemon/ontolex#>
Select * {
  SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;
      separator=", ") AS ?wrs) 
  where{
    ?lemma ontolex:writtenRep ?wr
  } group by ?lemma
}
order by ?wrs

I think there is something wrong with the nesting optimization or something related to the order by function.

Can you please help me ?

thompsonbry commented 2 years ago

Adding Michael. Bryan

On Mon, Nov 22, 2021 at 11:02 AM Giovanni Moretti @.***> wrote:

I have another strange thing here. The following query now works with the previously mentioned trick :

PREFIX ontolex: http://www.w3.org/ns/lemon/ontolex#

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) where{ ?lemma ontolex:writtenRep ?wr } group by ?lemma

but this other one stucks again:

PREFIX ontolex: http://www.w3.org/ns/lemon/ontolex#

SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ;separator=", ") AS ?wrs) where{ ?lemma ontolex:writtenRep ?wr } group by ?lemma order by ?wrs

to fix this problem I have to write the query like this:

PREFIX ontolex: http://www.w3.org/ns/lemon/ontolex# Select * { SELECT ?lemma (GROUP_CONCAT(DISTINCT ?wr ; separator=", ") AS ?wrs) where{ ?lemma ontolex:writtenRep ?wr } group by ?lemma } order by ?wrs

I think there is something wrong with the nesting optimization or something related to the order by function.

Can you please help me ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/213#issuecomment-975828876, or unsubscribe https://github.com/notifications/unsubscribe-auth/AATW7YDAVDW4W7XOIUB7VDLUNKHT7ANCNFSM5IHWAWYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mschmidt00 commented 2 years ago

Looking at the first question again in more detail, I am attaching the two explains (for a data set with 20k triples, all having distinct subjects and objects but the same prediacte). As suggested initially, there is a difference in the plan: the slow version (explain-wo-limit) contains a dedicated ChunkedMaterializationOp at the end, which clearly dominates runtime.

explain-w-limit

explain-wo-limit

Looking at the code, I guess the problem happens here: https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata-rdf/src/java/com/bigdata/bop/rdf/join/ChunkedMaterializationOp.java#L375

We add the constructed TermIds (all with termId=0L, indicating that they are mocked) to the idToConstMap. I confirmed that all of the collide in the hash code (which is 0 as well).

Possible ideas for fixes would be

1.) Create a decorator for the idToConstMap (see e.g. https://stackoverflow.com/questions/41461515/hashmap-implementation-that-allows-to-override-hash-and-equals-method) 2.) Wrap the keys into dummy objects that override hashCode(), refining the method to return a hash code different from 0L for TermIds with that have TermId 0L and a value set (i.e., just the hash code of the value) 3.) Leave TermIds with 0L out of the map in the first place and handle them separately (iiuc, the main reason for using the hash map is de-duplication, which may be of limited value for constructed values anyways -- though I would consider this a less preferred option)

It may also be possible to change hashCode() in TermId directly, i.e. if termId==0L compute the hash code based on the cached value always. While this would be the "clean" way to fix the issue, it sounds quite risky (and may cause problems with other code paths -- while I cannot come up with a scenario where this is problematic and I would believe it is the right change conceptually, it's really difficult to assess the blast radius of such an invasive change, both in terms of correctness and performance).

mschmidt00 commented 2 years ago

Regarding your proposal:" I set the materializeProjectionInQuery variable in the AST2BOpContext class to false and thus I implicitly forced the projection to be materialized outside the plan". Yes, I think that should be okay as well, but it is more of a workaround (which may negatively impact the performance of other queries).

mschmidt00 commented 2 years ago

Regarding your second question, here's the explain for the variant with ORDER BY (executed on some dummy data). As you can see, it contains two ChunkedMaterializationOp's that are expensive, due to the same reasons discussed above (with your change for materializeProjectionInQuery, there would probably be only one):

explain-order-by-variant

In the light of this variant, my recommendation would be to fix the root cause (hash collisions) instead of changing materializeProjectionInQuery.