aws / graph-explorer

React-based web application that enables users to visualize both property graph and RDF data and explore connections between data without having to write graph queries.
https://github.com/aws/graph-explorer
Apache License 2.0
314 stars 46 forks source link

[Bug] Neighbor expansion is slow and unreliable #324

Open kmcginnes opened 4 months ago

kmcginnes commented 4 months ago

Community Note

Description

When performing neighbor expansion, the generated queries can be quite complex and resource intensive. Some create 50 or more network requests and can overwhelm the database server where the requests will begin to fail.

To Reproduce

Gremlin

I selected one node, and expanded 1 neighbor.

g.V("Keri Eastman").project("vertices", "edges")
  .by(both().hasLabel("post").dedup().range(0, 1).fold())
  .by(bothE().where(otherV().hasLabel("post")).dedup().fold())

This second query will be repeated for each neighbor being expanded. So if you have 50 neighbors, you would have 50 requests.

g.V("https://example.com/blogs/unlocking-the-power-of-relationships-how-graph-databases-are-revolutionizing-data-management/")
  .both().limit(500).dedup().group().by(label).by(count())

openCypher

Expanding an author with 6 posts:

MATCH (v)-[e]-(tgt:post) 
WHERE ID(v) = "Keri Eastman" 
WITH 
  collect(DISTINCT tgt)[..6] AS vObjects, 
  collect({edge: e, sourceType: labels(v), targetType: labels(tgt)})[..6] AS eObjects 
RETURN vObjects, eObjects
MATCH (v)-[e]-(tgt:post) 
WHERE ID(v) = "Keri Eastman" 
  AND ID(e) IN ["5E2E0417-FE9B-43E9-9D80-3D4F83AD90AC","92072F3D-7B73-4E58-8E4E-237B9866601F","2C7DABC9-074F-4BC9-B742-F178699FE5D9","B962DF33-D8C6-42DC-BE3D-4221F1189399","2CFFB4AF-F4E1-471A-AEDE-2EBE8EFA5C4C","B3D9257D-AE4C-4F38-B816-76ADBE6CBA8C"] 
WITH 
  collect(DISTINCT tgt)[..6] AS vObjects, 
  collect({edge: e, sourceType: labels(v), targetType: labels(tgt)})[..6] AS eObjects 
RETURN vObjects, eObjects

One request per neighbor, so 6 in this case:

MATCH (v) -[e]- (t) 
WHERE ID(v) = "https://example.com/blogs/unlocking-the-power-of-relationships-how-graph-databases-are-revolutionizing-data-management/" 
RETURN labels(t) AS vertexLabel, count(DISTINCT t) AS count 
LIMIT 500

SPARQL

Since the RDF data was different than PG data in the same Neptune instance, this is a single Airpot node expanding a single Country node. This resulted in 3 queries.

SELECT ?subject ?pred ?value ?subjectClass ?pToSubject ?pFromSubject {
  ?subject a     ?subjectClass;
           ?pred ?value {
    SELECT DISTINCT ?subject ?pToSubject ?pFromSubject {
      BIND(<http://kelvinlawrence.net/air-routes/resource/303> AS ?argument)
      VALUES ?subjectClass { <http://kelvinlawrence.net/air-routes/class/Country> }
      {
        ?argument ?pToSubject ?subject.
        ?subject a         ?subjectClass;
                 ?sPred    ?sValue .

      }
      UNION
      {
        ?subject ?pFromSubject ?argument;
                 a         ?subjectClass;
                 ?sPred    ?sValue .

      }
    }
    LIMIT 75 OFFSET 0
  }
  FILTER(isLiteral(?value))
}
SELECT ?subject ?subjectClass ?predToSubject ?predFromSubject {
  BIND(<http://kelvinlawrence.net/air-routes/resource/303> AS ?argument)
  VALUES ?subject { <http://kelvinlawrence.net/air-routes/resource/3505>}
  { 
    ?argument ?predToSubject ?subject.
    ?subject a ?subjectClass.
  }
  UNION
  { 
    ?subject ?predFromSubject ?argument;
             a                ?subjectClass.
  }
}
SELECT ?class (COUNT(?class) AS ?count) {
  ?subject a ?class {
    SELECT DISTINCT ?subject ?class {
      ?subject a ?class .
      { ?subject ?p <http://kelvinlawrence.net/air-routes/resource/3505> }
      UNION
      { <http://kelvinlawrence.net/air-routes/resource/3505> ?p ?subject }
    }

  }
}
GROUP BY ?class

Expected behavior

Each language should be optimized. The query seems to be retrieving data for multiple parts of the app (graph and sidebar). We can split those tasks to further optimize, deferring the side bar work until needed.

Related issues

Tasks

kmcginnes commented 4 months ago

Gremlin also seems to have issues when limiting the neighbors to expand in the side bar.

This is the Gremlin generated to find 4 movies (limited to a max of 4) connected to Sean Connery:

g.V(\"nm0000125\").
    project(\"vertices\", \"edges\").
      by(both().hasLabel(\"movie\").dedup().range(0, 4).fold()).
      by(bothE().where(otherV().hasLabel(\"movie\")).dedup().fold())

Which returns the following and includes a lot of extra edges (75 in all) as there is no limit placed on the second by

{
    'vertices': [v[tt0094226], v[tt0102034], v[tt0058150], v[tt0081633]], 
    'edges': [
        e[tt0094226-actor-nm0000125][tt0094226-actor->nm0000125], 
        e[tt0102034-actor-nm0000125][tt0102034-actor->nm0000125], 
        e[tt0058150-actor-nm0000125][tt0058150-actor->nm0000125], 
        e[tt0081633-actor-nm0000125][tt0081633-actor->nm0000125], 
        e[tt0107969-actor-nm0000125][tt0107969-actor->nm0000125], 
        e[tt0070948-actor-nm0000125][tt0070948-actor->nm0000125], 
        e[tt0145734-actor-nm0000125][tt0145734-actor->nm0000125], 
        e[tt0075784-actor-nm0000125][tt0075784-actor->nm0000125], 
        e[tt0097328-actor-nm0000125][tt0097328-actor->nm0000125], 
        e[tt0117500-actor-nm0000125][tt0117500-actor->nm0000125], 
        e[tt0113071-actor-nm0000125][tt0113071-actor->nm0000125], 
        e[tt0058754-actor-nm0000125][tt0058754-actor->nm0000125], 
        e[tt0066995-actor-nm0000125][tt0066995-actor->nm0000125], 
        e[tt0057076-actor-nm0000125][tt0057076-actor->nm0000125], 
        e[tt0055262-actor-nm0000125][tt0055262-actor->nm0000125], 
        e[tt0073796-actor-nm0000125][tt0073796-actor->nm0000125], 
        e[tt0084750-actor-nm0000125][tt0084750-actor->nm0000125], 
        e[tt0071877-actor-nm0000125][tt0071877-actor->nm0000125], 
        e[tt0051364-actor-nm0000125][tt0051364-actor->nm0000125], 
        e[tt0091605-actor-nm0000125][tt0091605-actor->nm0000125], 
        e[tt0084920-actor-nm0000125][tt0084920-actor->nm0000125], 
        e[tt0074962-actor-nm0000125][tt0074962-actor->nm0000125], 
        e[tt0066767-actor-nm0000125][tt0066767-actor->nm0000125], 
        e[tt0052722-actor-nm0000125][tt0052722-actor->nm0000125], 
        e[tt0099810-actor-nm0000125][tt0099810-actor->nm0000125], 
        e[tt0067315-actor-nm0000125][tt0067315-actor->nm0000125], 
        e[tt0104839-actor-nm0000125][tt0104839-actor->nm0000125], 
        e[tt0075147-actor-nm0000125][tt0075147-actor->nm0000125], 
        e[tt0079550-actor-nm0000125][tt0079550-actor->nm0000125], 
        e[tt0109920-actor-nm0000125][tt0109920-actor->nm0000125], 
        e[tt0118661-actor-nm0000125][tt0118661-actor->nm0000125], 
        e[tt0091203-actor-nm0000125][tt0091203-actor->nm0000125], 
        e[tt0083947-actor-nm0000125][tt0083947-actor->nm0000125], 
        e[tt0097576-actor-nm0000125][tt0097576-actor->nm0000125], 
        e[tt0082869-actor-nm0000125][tt0082869-actor->nm0000125], 
        e[tt0055928-actor-nm0000125][tt0055928-actor->nm0000125], 
        e[tt0058329-actor-nm0000125][tt0058329-actor->nm0000125], 
        e[tt0116136-actor-nm0000125][tt0116136-actor->nm0000125], 
        e[tt0181536-actor-nm0000125][tt0181536-actor->nm0000125], 
        e[tt0113501-actor-nm0000125][tt0113501-actor->nm0000125], 
        e[tt0073341-actor-nm0000125][tt0073341-actor->nm0000125], 
        e[tt0137494-actor-nm0000125][tt0137494-actor->nm0000125], 
        e[tt0086006-actor-nm0000125][tt0086006-actor->nm0000125], 
        e[tt0070468-actor-nm0000125][tt0070468-actor->nm0000125], 
        e[tt0062512-actor-nm0000125][tt0062512-actor->nm0000125], 
        e[tt0084011-actor-nm0000125][tt0084011-actor->nm0000125], 
        e[tt0054898-actor-nm0000125][tt0054898-actor->nm0000125], 
        e[tt0095897-actor-nm0000125][tt0095897-actor->nm0000125], 
        e[tt0073906-actor-nm0000125][tt0073906-actor->nm0000125], 
        e[tt0311429-actor-nm0000125][tt0311429-actor->nm0000125], 
        e[tt0066090-actor-nm0000125][tt0066090-actor->nm0000125], 
        e[tt0060414-actor-nm0000125][tt0060414-actor->nm0000125], 
        e[tt0063592-actor-nm0000125][tt0063592-actor->nm0000125], 
        e[tt0100530-actor-nm0000125][tt0100530-actor->nm0000125], 
        e[tt0079013-actor-nm0000125][tt0079013-actor->nm0000125], 
        e[tt0059800-actor-nm0000125][tt0059800-actor->nm0000125], 
        e[tt0059274-actor-nm0000125][tt0059274-actor->nm0000125], 
        e[tt0079240-actor-nm0000125][tt0079240-actor->nm0000125], 
        e[tt1527819-actor-nm0000125][tt1527819-actor->nm0000125]
    ]
}

here is how it could also be written (simpler and no extra edges)

g.V("nm0000125").
    bothE().as('e').
    otherV().
    hasLabel('movie').as('n').
    limit(4).
    select('n','e')

which returns

{'n': v[tt0094226], 'e': e[tt0094226-actor-nm0000125][tt0094226-actor->nm0000125]}
{'n': v[tt0102034], 'e': e[tt0102034-actor-nm0000125][tt0102034-actor->nm0000125]}
{'n': v[tt0081633], 'e': e[tt0081633-actor-nm0000125][tt0081633-actor->nm0000125]}
{'n': v[tt0058150], 'e': e[tt0058150-actor-nm0000125][tt0058150-actor->nm0000125]}