blazegraph / database

Blazegraph High Performance Graph Database
GNU General Public License v2.0
873 stars 170 forks source link

PipelinedHashIndexAndSolutionSetJoinOp messes up SERVICE + MINUS #107

Open smalyshev opened 5 years ago

smalyshev commented 5 years ago

When running this query with LIMIT, MINUS does not seem to apply properly:

SELECT DISTINCT ?item WHERE {
  hint:Query hint:optimizer "None".
  SERVICE wikibase:mwapi {
    bd:serviceParam wikibase:api "Search";
                    wikibase:endpoint "www.wikidata.org";
                    mwapi:srsearch "zika haswbstatement:P31=Q13442814".
    ?title wikibase:apiOutput mwapi:title.
  }
  BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?item)
  MINUS { ?item wdt:P921 wd:Q202864. }
} 
# will work fine without LIMIT
LIMIT 10000

The difference is that without LIMIT, the query uses JVMSolutionSetHashJoinOp to perform the join, while with LIMIT it uses PipelinedHashIndexAndSolutionSetJoinOp (per usePipelinedHashJoin in AST2Bop.addSubgroup). That last one seems to not do the MINUS correctly.

smalyshev commented 5 years ago

Digging a bit more into it, in JVMPipelinedHashJoinUtility/acceptAndOutputSolutions this loop:

for (IBindingSet solution : solutions) {

                        // add solutions to the subquery into the hash index.
                        rightSolutions.add(solution);

                        /*
                         * we remove all mappings that generated at least one
                         * result from distinct set (which will be further
                         * processed later on); This is how we discover the set
                         * of distinct projections that did not join.
                         */
                        distinctProjectionBuffer.remove(solution.copy(getJoinVars()));

                        nResultsFromSubqueries.increment();
                    }

Is supposed to remove mappings from distinctProjectionBuffer that have matching solutions. In fact, it does not remove anything (distinctProjectionBuffer remains the same after the loop is done). This seems to happen because getJoinVars() is empty here (maybe this is a bug, because there's obviously a join over "item" is happening) and thus it tries to remove an empty binding, not binding for "item" as it was supposed to. This in turn leads to all keys be registered as "not matched" with distinctProjectionsWithoutSubqueryResult, which means they make out on MINUS join unaffected, thus essentially ignoring the MINUS.

It may be that the real bug is whatever is causing joinVars to be empty in this case, but I am not sure how to find what causes it yet, since I am not sure what is the meaning of joinVars. The other (correct) query also have joinVars empty in JVMSolutionSetHashJoinOp, but still manages to work fine and do the correct join.

thompsonbry commented 5 years ago

Adding Michael. Bryan

On Fri, Nov 16, 2018 at 11:02 AM Stanislav Malyshev < notifications@github.com> wrote:

Digging a bit more into it, in JVMPipelinedHashJoinUtility/acceptAndOutputSolutions this loop:

for (IBindingSet solution : solutions) {

                    // add solutions to the subquery into the hash index.
                    rightSolutions.add(solution);

                    /*                         * we remove all mappings that generated at least one                         * result from distinct set (which will be further                         * processed later on); This is how we discover the set                         * of distinct projections that did not join.                         */
                    distinctProjectionBuffer.remove(solution.copy(getJoinVars()));

                    nResultsFromSubqueries.increment();
                }

Is supposed to remove mappings from distinctProjectionBuffer that have matching solutions. In fact, it does not remove anything ( distinctProjectionBuffer remains the same after the loop is done). This seems to happen because getJoinVars() is empty here (maybe this is a bug, because there's obviously a join over "item" is happening) and thus it tries to remove an empty binding, not binding for "item" as it was supposed to. This in turn leads to all keys be registered as "not matched" with distinctProjectionsWithoutSubqueryResult, which means they make out on MINUS join unaffected, thus essentially ignoring the MINUS.

It may be that the real bug is whatever is causing joinVars to be empty in this case, but I am not sure how to find what causes it yet, since I am not sure what is the meaning of joinVars. The other (correct) query also have joinVars empty in JVMSolutionSetHashJoinOp, but still manages to work fine and do the correct join.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-439493654, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4PjTd6pF0Kn2reWWqhvsvaXDdgP6ks5uvwvIgaJpZM4Ymy6I .

smalyshev commented 5 years ago

@mschmidt00 ?

thompsonbry commented 5 years ago

Michael Schmidt. He wrote the pipelined hash join code. Bryan On Fri, Nov 16, 2018 at 11:13 AM Stanislav Malyshev notifications@github.com wrote:

@mschmidt00 ?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

beebs-systap commented 5 years ago

Yes, @mschmidt00.

smalyshev commented 5 years ago

Any thoughts on this one?

smalyshev commented 5 years ago

ping?

thompsonbry commented 5 years ago

I asked Michael to take a look. He did take a brief look. Suggest you and he follow up.

It would definitely help to provide plain blazegraph test cases for bugs whenever possible. Make it much easier to reproduce and analyze problems. The time spent developing a test case reproducer can be very significant for us otherwise.

Bryan

On Wed, Dec 5, 2018 at 4:35 PM Stanislav Malyshev notifications@github.com wrote:

ping?

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-444704517, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4OqF_0xraJrBUIBWvKzJHdIceG53ks5u2GZpgaJpZM4Ymy6I .

smalyshev commented 5 years ago

I am sorry, I understand it's hard to debug it this way. I can reproduce it with clean install of WDQS and some small data set - so if you'd like, I can set up reproduction on Cloud VPS and you and Michael should have a login there. I haven't found a way to reproduce it without service since it produces a different query plan. I'll try to see maybe I can do the same with federation SERVICE call.

thompsonbry commented 5 years ago

Another technique to capture these problems is to enable the solutions logger in log4j.properties. This shows everything flowing into/out of each operator. That can give you the data you need to write a unit test of a specific operator or a subplan.

Bryan

On Wed, Dec 5, 2018 at 11:22 PM Stanislav Malyshev notifications@github.com wrote:

I am sorry, I understand it's hard to debug it this way. I can reproduce it with clean install of WDQS and some small data set - so if you'd like, I can set up reproduction on Cloud VPS and you and Michael should have a login there. I haven't found a way to reproduce it without service since it produces a different query plan. I'll try to see maybe I can do the same with federation SERVICE call.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-444773387, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4LPdHARAgxgr4-C_trFc_pxyZK9Uks5u2MWbgaJpZM4Ymy6I .

thompsonbry commented 5 years ago

This section should be in the default log4j.properties file. You just uncomment one line to enable this logger.

Solutions trace (tab delimited file). Uncomment the next line to enable.

log4j.logger.com.bigdata.bop.engine.SolutionsLog=INFO,solutionsLog

log4j.additivity.com.bigdata.bop.engine.SolutionsLog=false log4j.appender.solutionsLog=org.apache.log4j.ConsoleAppender

log4j.appender.solutionsLog=org.apache.log4j.FileAppender

log4j.appender.solutionsLog.Threshold=ALL

log4j.appender.solutionsLog.File=solutions.csv

log4j.appender.solutionsLog.Append=true

I find that it is nicer to have this unbuffered since you can see what

is going on and to make sure that I have complete rule evaluation logs

on shutdown.

log4j.appender.solutionsLog.BufferedIO=false

log4j.appender.solutionsLog.layout=org.apache.log4j.PatternLayout log4j.appender.solutionsLog.layout.ConversionPattern=SOLUTION:\t%m

On Thu, Dec 6, 2018 at 6:57 AM Bryan Thompson bryan@blazegraph.com wrote:

Another technique to capture these problems is to enable the solutions logger in log4j.properties. This shows everything flowing into/out of each operator. That can give you the data you need to write a unit test of a specific operator or a subplan.

Bryan

On Wed, Dec 5, 2018 at 11:22 PM Stanislav Malyshev < notifications@github.com> wrote:

I am sorry, I understand it's hard to debug it this way. I can reproduce it with clean install of WDQS and some small data set - so if you'd like, I can set up reproduction on Cloud VPS and you and Michael should have a login there. I haven't found a way to reproduce it without service since it produces a different query plan. I'll try to see maybe I can do the same with federation SERVICE call.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-444773387, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4LPdHARAgxgr4-C_trFc_pxyZK9Uks5u2MWbgaJpZM4Ymy6I .

mschmidt00 commented 5 years ago

Hey Stas,

it strikes me there's something wrong in the way we're using projectInVars vs. joinVars. When filling the distinctProjectionBuffer, we're using projectInVars to project on the incoming solutions (see some lines above the code you've been citing):

            // Take a distinct projection of the join variables.
            final IBindingSet bsetDistinct = chunk[i].copy(projectInVars);

Later, in the loop you identified, we're doing a projection on the joinVars:

                    distinctProjectionBuffer.remove(solution.copy(getJoinVars()));

While there is a (typically strong) correlation between joinVars and projectInVars they are not necessarily the same (joinVars is a subset of projectInVars). I didn't validate it, but my guess would be that the latter should be changed to:

                    distinctProjectionBuffer.remove(solution.copy(projectInVars);

Could be worth trying that as a fix, but I'd definitely want to carefully validate this, as there might be some unexpected implications/complications/consequences. In any case, it is important that the bindings that produce results via the subquery are removed from the distinctProjectionBuffer, and the fact that this is not happening seems to be the root cause for the MINUS semantics being broken.

Could you please share the value of the following variables at the time when entering the loop you identified (the values I give below would reflect my guesses/expectations) -- if there are too many solutions, few sample values would help as well:

With this data at hand, I may also be able to come up with a SPARQL standard only reproducer for the problem.

Some background re. your comment on joinVars: for hash joins, the join variables are defined as those variables from the LHS and RHS of the join that overlap and are guaranteed to be bound in both parts. The situation for the query above is that "BIND(IRI(CONCAT(STR(wd:), ?title)) AS ?item)" does not give you a strong guarantee that the ?item variable will be bound (in the general case, BIND can fail and produce an unbound variable). I'm pretty sure this is the reason why joinVars are actually empty, and it is expected to be that way.

smalyshev commented 5 years ago

Tracing into acceptAndOutputSolutions, I see this:

Changing getJoinVars() to projectInVars seems to change the solutions to be removed, but the end result seems to be still the same, as if MINUS did not work. I'll dig into it with logs and see what's going on further.

smalyshev commented 5 years ago

Tried to debug further and I see something strange - when I watch how the execution process proceeds, the first chunk is processed OK. It forms one bucket, since the key formed by JVMHashIndex.makeKey() has no variables (keyVars is empty). However, when the second chunk comes in, it goes straight to dontRequireSubqueryEvaluation branch, because it reuses the same bucket. However, this bucket is not good for this chunk - it seems to contain the solutions relative to the previous chunk, and trying to match it against this chunk results in "no match" for every binding set, which of course means they all get into the result and MINUS does not work. This is after applying the projectInVars fix (looks like it also needs to be applied to HTree class for analytic queries? Because it has similar code) - this fix makes the first chunk work OK, but the subsequent chunk still is not right. keyVars seem to be derived from joinVars in JVMHashJoinUtility, and as we know, joinVars are empty here. So I wonder if there should be some other code that does not reuse the same bucket?

For reference, this is the code that looks up buckets in acceptAndOutputSolutions():

               /**
                 *  Find bucket in hash index for that distinct projection
                 * (bucket of solutions with the same join vars from the
                 *  subquery - basically a JOIN).
                 */
                final Bucket b = rightSolutions.getBucket(bsetDistinct);

Since rightSolutions has empty keyVars, getBucket always returns the same bucket. On first chunk, it's empty, so the code proceeds to regular evaulation and everything is fine. On the second chunk, however, there is the bucket made by the first chunk, which looks completely wrong - since in this bucket there would be no solutions matching the first bucket. Which also implies having just one bucket after first run is wrong (there should be 100 distinct buckets) but this does not hurt anything since these buckets are not used yet. I think the bug is where keyVars is empty. Would be glad to hear your opinion on this.

smalyshev commented 5 years ago

Also interesting - there's this code in JVMHashIndex:

        if (keyVars == null) {

            /*
             * A ZERO LENGTH joinVars[] means that all solutions will be in the
             * same hash bucket. This can arise due to poor assignment of join
             * variables or simply because there are no available join variables
             * (full cross product join). Such joins are very expensive.
             */

            throw new IllegalArgumentException();

        }

which implies keyVars should never be empty? But in fact it is empty, just [] not null.

smalyshev commented 5 years ago

I did a small patch that uses Annotations.PROJECT_IN_VARS instead of HashJoinAnnotations.JOIN_VARS for PipelinedHashIndexAndSolutionSetJoinOp, and it seems to fix the problem (together with the projectInVars one) for non-analytic queries, but it looks like analytic class has the same logic, but different code, which requires different handling. What I did is:

--- a/bigdata-core/bigdata/src/java/com/bigdata/bop/join/JVMHashJoinUtility.java
+++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/join/JVMHashJoinUtility.java
@@ -329,7 +329,7 @@ public class JVMHashJoinUtility implements IHashJoinUtility {
          * Otherwise use the [joinVars].
          */
         final IVariable<?>[] keyVars = filter ? (IVariable<?>[]) op
-                .getProperty(JoinAnnotations.SELECT) : joinVars;
+                .getProperty(JoinAnnotations.SELECT) : op.getKeyVars();^M
--- a/bigdata-core/bigdata/src/java/com/bigdata/bop/join/PipelinedHashIndexAndSolutionSetJoinOp.java
+++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/join/PipelinedHashIndexAndSolutionSetJoinOp.java
@@ -479,4 +479,7 @@ public class PipelinedHashIndexAndSolutionSetJoinOp extends HashIndexOp {

     } // ControllerTask

+    public IVariable<?>[] getKeyVars() {^M
+        return (IVariable<?>[]) getProperty(Annotations.PROJECT_IN_VARS);^M
+    }^M
 }
--- a/bigdata-core/bigdata/src/java/com/bigdata/bop/PipelineOp.java
+++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/PipelineOp.java
@@ -34,6 +34,7 @@ import java.util.concurrent.FutureTask;
 import com.bigdata.bop.engine.BOpStats;
 import com.bigdata.bop.engine.IChunkMessage;
 import com.bigdata.bop.engine.QueryEngine;
+import com.bigdata.bop.join.HashJoinAnnotations;^M

 /**
  * Abstract base class for pipeline operators where the data moving along the
@@ -614,4 +615,7 @@ abstract public class PipelineOp extends BOpBase {
      */
     abstract public FutureTask<Void> eval(BOpContext<IBindingSet> context);

+    public IVariable<?>[] getKeyVars() {^M
+        return (IVariable<?>[])getRequiredProperty(HashJoinAnnotations.JOIN_VARS);^M
+    }^M
 }

no idea whether it's correct...

thompsonbry commented 5 years ago

I suspect that the correct behavior when there are no join variables is to map all solutions into the same bucket - same key.

On Mon, Dec 10, 2018 at 17:13 Stanislav Malyshev notifications@github.com wrote:

I did a small patch that uses Annotations.PROJECT_IN_VARS instead of HashJoinAnnotations.JOIN_VARS for PipelinedHashIndexAndSolutionSetJoinOp, and it seems to fix the problem (together with the projectInVars one) for non-analytic queries, but it looks like analytic class has the same logic, but different code, which requires different handling. What I did is:

--- a/bigdata-core/bigdata/src/java/com/bigdata/bop/join/JVMHashJoinUtility.java +++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/join/JVMHashJoinUtility.java @@ -329,7 +329,7 @@ public class JVMHashJoinUtility implements IHashJoinUtility {

  • Otherwise use the [joinVars]. */ final IVariable<?>[] keyVars = filter ? (IVariable<?>[]) op

    • .getProperty(JoinAnnotations.SELECT) : joinVars;
    • .getProperty(JoinAnnotations.SELECT) : op.getKeyVars();^M --- a/bigdata-core/bigdata/src/java/com/bigdata/bop/join/PipelinedHashIndexAndSolutionSetJoinOp.java +++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/join/PipelinedHashIndexAndSolutionSetJoinOp.java @@ -479,4 +479,7 @@ public class PipelinedHashIndexAndSolutionSetJoinOp extends HashIndexOp {

      } // ControllerTask

  • public IVariable<?>[] getKeyVars() {^M

  • return (IVariable<?>[]) getProperty(Annotations.PROJECT_IN_VARS);^M

  • }^M } --- a/bigdata-core/bigdata/src/java/com/bigdata/bop/PipelineOp.java +++ b/bigdata-core/bigdata/src/java/com/bigdata/bop/PipelineOp.java @@ -34,6 +34,7 @@ import java.util.concurrent.FutureTask; import com.bigdata.bop.engine.BOpStats; import com.bigdata.bop.engine.IChunkMessage; import com.bigdata.bop.engine.QueryEngine; +import com.bigdata.bop.join.HashJoinAnnotations;^M

    /**

    • Abstract base class for pipeline operators where the data moving along the @@ -614,4 +615,7 @@ abstract public class PipelineOp extends BOpBase { */ abstract public FutureTask eval(BOpContext context);
  • public IVariable<?>[] getKeyVars() {^M

  • return (IVariable<?>[])getRequiredProperty(HashJoinAnnotations.JOIN_VARS);^M

  • }^M }

no idea whether it's correct...

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/blazegraph/database/issues/107#issuecomment-446035494, or mute the thread https://github.com/notifications/unsubscribe-auth/ACdv4HbNtNOyyOAW6OvbBBX4TmcgNb6Uks5u3wadgaJpZM4Ymy6I .

smalyshev commented 5 years ago

But then the bucket is re-used here:

                /**
                 *  Find bucket in hash index for that distinct projection
                 * (bucket of solutions with the same join vars from the
                 *  subquery - basically a JOIN).
                 */
                final Bucket b = rightSolutions.getBucket(bsetDistinct);

                if (b != null ||
                    distinctProjectionsWithoutSubqueryResult.contains(bsetDistinct)) {
                    /*
                     * Either a match in the bucket or subquery was already
                     * computed for this distinct projection but did not produce
                     * any results. Either way, it will take a fast path that
                     * avoids the subquery.
                     */
                    dontRequireSubqueryEvaluation.add(chunk[i]);

                } else {

and these solutions are emitted:

       if (!dontRequireSubqueryEvaluation.isEmpty()) {
            // compute join for the fast path.
            hashJoinAndEmit(dontRequireSubqueryEvaluation.toArray(
               new IBindingSet[0]), stats, out, joinConstraints, askVar);
        }
       if (distinctProjectionBuffer.isEmpty()) {
            // Nothing to do on the slow code path.
            return naccepted;
        }

without running the subquery, but the results in the bucket are from previous subquery run with first chunk's data. And since distinctProjectionBuffer would be empty (previous chunk cleaned it up and new chunk hasn't added anything) the subquery won't be evaluated for this chunk. Is that right? Am I missing something here?

smalyshev commented 5 years ago

I think I found simple reproduction for it. Load the data from here: https://gist.github.com/smalyshev/b45a4dd739592361b413f5234de47dd7

Then run this:

SELECT DISTINCT ?item WHERE {
  hint:Query hint:optimizer "None".
  ?x <label> ?title .
  BIND(IRI(CONCAT("http://www.wikidata.org/entity/", ?title)) AS ?item)
  MINUS { 
    ?item <is> <bad> 
  }
} LIMIT 1000

It returns 199 results. Without limit, it returns (correctly) only one result.

thompsonbry commented 5 years ago

Adding Michael. Bryan On Mon, Dec 10, 2018 at 9:57 PM Stanislav Malyshev notifications@github.com wrote:

I think I found simple reproduction for it. Load the data from here: https://gist.github.com/smalyshev/b45a4dd739592361b413f5234de47dd7

Then run this:

SELECT DISTINCT ?item WHERE { hint:Query hint:optimizer "None". ?x

It returns 199 results. Without limit, it returns (correctly) only one result.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mschmidt00 commented 5 years ago

On it -- confirm the reproducer works for me, thanks Stas!

mschmidt00 commented 5 years ago

Hi Stas, as I understand using the key vars for join variables would not work -- the constraint for join variables is that they must be guaranteed to be bound.

I dived a little deeper and identified the existing problems, which are:

1.) JVMPipelinedHashJoinUtility.java 1a.) distinctProjectionBuffer.remove(solution.copy(getJoinVars())); => getJoinVars() needs to be replaced by projectInVars, as dicussed previously 1b.) In the check

if (b != null || 
                distinctProjectionsWithoutSubqueryResult.contains(bsetDistinct))

the b != null condition is not sufficient, we must also check whether distinctProjectionsWithoutSubqueryResult.contains(bsetDistinct) -- i.e., consider the non-join variables (this is because the join variables is empty and the join is performed over the non-join variables).

2.) HTreePipelinedHashJoinUtility.java 2a.) distinctProjectionBuffer.remove(solution.copy(getJoinVars())); => getJoinVars() needs to be replaced by projectInVars, as dicussed previously 2b.) The code

leftSolutionsWithoutMatch.add(leftSolution); // unless proven otherwise

is in the wrong place. The idea of the code was to initialize the array once, in front of the double-nested look (but by mistake the code ended up in an inner loop, where the solution is added to leftSolutionsWithoutMatch over and over again -- one witness is actually sufficient for it to be removed it entirely).

smalyshev commented 5 years ago

the b != null condition is not sufficient, we must also check whether distinctProjectionsWithoutSubqueryResult.contains(bsetDistinct) -- i.e., consider the non-join variables (this is because the join variables is empty and the join is performed over the non-join variables).

So, should the condition be converted to && or I misunderstood this part?

The code leftSolutionsWithoutMatch.add(leftSolution); // unless proven otherwise is in the wrong place. The idea of the code was to initialize the array once, in front of the double-nested look

But it uses variable leftSolution which is initialized inside the loop. So what should be used instead there?

smalyshev commented 5 years ago

@mschmidt00 I've made https://github.com/blazegraph/database/pull/113 which seems to fix the queries - could you please check I understood you correctly?