erik-overdahl commented 5 years ago

Problem

We have a graph in which locations are connected by services. Services that have common key-values are connected by service paths. We want to find all the service paths we can use to get from Location A to Location Z.

The following query matches services that go directly from A to Z and service paths that take one hop to get from A to Z:

MATCH p=(origin:location)-[:route]->(:service)-[:service_path*0..1]->
        (:service)-[:route]->(destination:location)
WHERE origin.name='A' AND destination.name='Z'
RETURN p;

and runs fine.

But if we expand the search to service paths that may take two hops between A and Z:

MATCH p=(origin:location)-[:route]->(:service)-[:service_path*0..2]->
        (:service)-[:route]->(destination:location)
WHERE origin.name='A' AND destination.name='Z'
RETURN p;

the query runs forever.

It never times out or crashes the server - it just runs continuously.

However, if we make the variable-length part of the query bidirectional:

MATCH p=(origin:location)-[:route]->(:service)-[:service_path*0..2]-
        (:service)-[:route]->(destination:location)
WHERE origin.name='A' AND destination.name='Z'
RETURN p;

The same query that ran forever now executes instantaneously (~30ms on a dev database with default Postgres config).

More info

Behavior in different versions

This problem occurs in AgensGraph 2.2dev (cloned from GitHub). In Agens 2.1.0 , the first query -[:service_path*0..1]-> still works fine, but the broken query -[:service_path*0..2]-> AND the version that works in 2.2dev, -[:service_path*0..2]- result in an error:

ERROR:  btree index keys must be ordered by attribute

This leads us to believe that the problem is related to this commit, which was included as a bug fix in Agens 2.1.1

fix: Make re-scanning inner scan work

VLE threw "btree index keys must be ordered by attribute" because re-scanning inner scan is not properly done in some cases. A regression test is added to check it.

Duplicate paths returned, endlessly

In AgensBrowser v1.0, we are able to get results out using LIMIT on the end of the broken query. The query always returns the maximum number of rows, but the resulting graph is very sparse. Direct paths and paths with one hop show up, but only one longer path appears.

In the result set, the shorter paths are returned in individual records as expected, but first occurrence of a two-hop path is duplicated for the rest of the rows.

If we return some collection along with the paths, like RETURN p, nodes(p) LIMIT 100, the query again runs infinitely.

(Interestingly, this row duplication also occurs for a slightly different query in which we used the bidirectional fix, but the entire expected graph was returned. This may deserve its own post.)

The EXPLAIN plans are identical for the one-hop and two-hop queries

We could not compare EXPLAIN ANALYZE (because you can't ANALYZE a query that never finishes running) but the query plans were exactly identical between all of the queries - those that ran and those that didn't.

Increased logging revealed nothing

We set the logging level for Postgres to DEBUG5, the highest level, and ran the infinitely-running query. The logs showed nothing amiss.

Is this a bug or a problem with our data model?

joefagan commented 5 years ago

Hi Erik,

I'm not sure what you mean by a zero length path in [0..2]. Instead of [0..2] can you please replace with

[*1..2] or [*..2] `

Also perhap you can break up the match statement with something like MATCH (origin:location)-[r:route]->(s_origin:service) WHERE origin.name='A' WITH origin, r, s_origin MATCH p=(s_origin)-[:service_path*..2]->(s_destinaton:service) WITH origin, r, s_origin, s_destination MATCH (s_destination)-[:route]->(destination:location) WHERE destination.name='Z' ...`

erik-overdahl commented 5 years ago

The zero length path in [*0..2] is intended to match the scenario in which there is a direct service between 'A' and 'Z'. We have used this syntax in other queries without issue, and it appears to be a valid Cypher statement: see here and here. Using [*..2] does not return this path.

Although you are correct that the query can be broken out into multiple WITH... MATCH clauses, this is much more verbose, even for this simple query, and is not desirable for more complex queries.

The infinite run time with no time-out and the endless duplication of rows with no errors both seem like bugs.

Moreover, the fact that the query executed when the path was bidirectional, rather than directed, is counter-intuitive to our knowledge of Cypher.

erik-overdahl commented 5 years ago

I ran the query with -[*2..3]-> and got another infinite run-time, but immediate execution with a bidirectional -[*2..3]-. The zero-length path does not seem to be the issue.

EDIT

When I use a zero-length VLE, I can get proper results if the VLE is bidirectional.

MATCH p1=(:service)-[:service_path*0..]->(:service) 
RETURN p1;

[SUCCESS] return 1000 rows (cols=1), graph(nodes=17, edges=1)

MATCH p1=(:service)-[:service_path*0..]-(:service) 
RETURN p1;

[SUCCESS] return 64 rows (cols=1), graph(nodes=32, edges=16)

The zero-length edge should match a node to itself if the vlabels match - so (a:foo)-[*0..]-(b:foo) would return a row where a=b. This doesn't happen the vlabels don't match - in (a:foo)-[*0..]-(b:bar), the zero-length path is meaningless as a will never be the same as b.

I think that what happens with the zero-length path is that something gets messed up with the fact that there is no direction for a reflexive relationship. But I don't know why this would show up as matching the 1-length paths in an infinite loop.

This also doesn't touch the problem of infinite run-times for directed VLEs with length > 0.

erik-overdahl commented 5 years ago

It looks like the use of a VLE in a path that also has other paths breaks for some distances in the VLE.

Output is from AgensBrowser Web v1.0.

This runs fine

MATCH p=(origin:location)-[*0..]->(:service{in_group:1})
WHERE origin.name='A'
RETURN p;

[SUCCESS] return 584 rows (cols=1), graph(nodes=37, edges=90)

This does not - a very limited portion of the graph is returned, with the first path of the second-longest length being duplicated in the rest of the rows.

MATCH p1=(origin:location)-[:route]->(:service{in_group:1})-[:service_path*0..]->(g)
WHERE origin.name='A'
RETURN p1;

[SUCCESS] return 1000 rows (cols=1), graph(nodes=3, edges=2)

But if I remove the direction on the VLE, it runs fine.

MATCH p1=(origin:location)-[:route]->(:service{in_group:1})-[:service_path*0..]->(g)
WHERE origin.name='A'
RETURN p1;

[SUCCESS] return 8 rows (cols=1), graph(nodes=9, edges=8)

Matching a VLE of length >= 1 also runs fine in this example.

MATCH p1=(origin:location)-[:route]->(:service{in_group:1})-[:service_path*1..]->(g)
WHERE origin.name='A'
RETURN p1;

[SUCCESS] return 4 rows (cols=1), graph(nodes=9, edges=8)

However, as you can see from my previous comment, this behavior is not consistent.

Note that, with the way our data is structured, the first query should return vastly more information than the second.

erik-overdahl commented 5 years ago

https://github.com/bitnine-oss/ldbc-snb-agensgraph/blob/master/src/main/java/net/bitnine/ldbcimpl/AGClient.java#L650

Looks like zero-length paths are implemented in AgensGraph

erik-overdahl commented 5 years ago

I have been able to replicate the bug with a completely different graph (which has a substantially greater volume of data), leading me to believe that it is an AgensGraph bug. The AgensGraph version is 2.1.1

In this larger graph, running the -[*0..2]-> query from AgensBrowser v1.0 forces the server to restart with the following error:

An I/O error occurred while sending to the backend.; nested exception is org.postgresql.util.PSQLException: An I/O error occurred while sending to the backend., SQL Code->08006, no graph

I'm currently circumventing the issue with a query that looks like

MATCH zero-length
UNION ALL
MATCH variable-length

but this runs much slower than the bidirectional -[*0..2]- query.

jrgemignani commented 4 years ago

There is a fix in review that should address this issue.

bitnine-oss / agensgraph

Variable-length path query runs forever but executes immediately when edge is bidrectional #493

Problem

More info

Is this a bug or a problem with our data model?