Closed mohamedawnallah closed 8 months ago
I think this is a good idea, so if you want, you can implement it. I think we looked at this before and decided to keep it simple, but if you measured an actual difference in performance, then it's well worth the change.
It would be good to have this as optional parameters so that we do not have to update all crawlers at once.
I've just attempted this optimization several times across various crawlers, including the atlas_probes.py
crawler, and it consistently appears to be a false positive in terms of optimization.
You mean it does not consistently makes relationship creation faster? I tried it with your Atlas measurement crawler and it seemed to get faster, but that might also be because some nodes were cached during the second repetition...
Technically, it should at least not hurt to have this functionality, although I discussed with Romain and we think that the current query should already be fast. We select on exact node ids, i.e., this should be a constant time lookup. But of course we don't know exactly how neo4j organizes the nodes, so maybe a label could still help.
You mean it does not consistently make relationship creation faster?
Yes. Maybe something I missed while doing my verification test if it optimizes or not is that I removed all nodes and relationships before starting the crawler this way when we're loading the crawler only the nodes and relationships related to the crawler (e.g: AtlasProbe
, ...etc) exist in the neo4j which makes the search space for both cases equal.
Technically, it should at least not hurt to have this functionality, although I discussed with Romain and we think that the current query should already be fast. We select on exact node ids, i.e., this should be a constant time lookup. But of course, we don't know exactly how neo4j organizes the nodes, so maybe a label could still help.
Yes, you are alright. It doesn't hurt and may be using a label to optimize the query 👍
I'm gonna submit a PR for the atlas_probes
crawler after adding node labels to batch_add_links
and will try to do correct benchmarking to optimize the query more.
Hey,
when reviewing your PR I actually found that there is a PROFILE
command, which not only gives the query plan, but the number of database accesses as well. I tried for an example query with and without node labels:
PROFILE MATCH (x), (y)
WHERE ID(x) = 731
AND ID(y) = 1055
CREATE (x)-[r:TEST]->(y)
RETURN x, r, y
Planner COST
Runtime SLOTTED
Runtime version 5.0
+-------------------+----+----------------------------+----------------+------+---------+------------------------+
| Operator | Id | Details | Estimated Rows | Rows | DB Hits | Page Cache Hits/Misses |
+-------------------+----+----------------------------+----------------+------+---------+------------------------+
| +ProduceResults | 0 | x, r, y | 1 | 1 | 5 | 0/0 |
| | +----+----------------------------+----------------+------+---------+------------------------+
| +Create | 1 | (x)-[r:TEST]->(y) | 1 | 1 | 2 | 0/0 |
| | +----+----------------------------+----------------+------+---------+------------------------+
| +CartesianProduct | 2 | | 1 | 1 | 0 | 0/0 |
| |\ +----+----------------------------+----------------+------+---------+------------------------+
| | +NodeByIdSeek | 3 | y WHERE id(y) = $autoint_1 | 1 | 1 | 1 | 0/0 |
| | +----+----------------------------+----------------+------+---------+------------------------+
| +NodeByIdSeek | 4 | x WHERE id(x) = $autoint_0 | 1 | 1 | 1 | 0/0 |
+-------------------+----+----------------------------+----------------+------+---------+------------------------+
Total database accesses: 9, total allocated memory: 64
PROFILE MATCH (x:AS), (y:AS)
WHERE ID(x) = 731
AND ID(y) = 1055
CREATE (x)-[r:TEST]->(y)
RETURN x, r, y
Planner COST
Runtime SLOTTED
Runtime version 5.0
+-------------------+----+----------------------------+----------------+------+---------+------------------------+
| Operator | Id | Details | Estimated Rows | Rows | DB Hits | Page Cache Hits/Misses |
+-------------------+----+----------------------------+----------------+------+---------+------------------------+
| +ProduceResults | 0 | x, r, y | 0 | 1 | 5 | 0/0 |
| | +----+----------------------------+----------------+------+---------+------------------------+
| +Create | 1 | (x)-[r:TEST]->(y) | 0 | 1 | 2 | 0/0 |
| | +----+----------------------------+----------------+------+---------+------------------------+
| +CartesianProduct | 2 | | 0 | 1 | 0 | 0/0 |
| |\ +----+----------------------------+----------------+------+---------+------------------------+
| | +Filter | 3 | y:AS | 0 | 1 | 1 | 0/0 |
| | | +----+----------------------------+----------------+------+---------+------------------------+
| | +NodeByIdSeek | 4 | y WHERE id(y) = $autoint_1 | 0 | 1 | 1 | 0/0 |
| | +----+----------------------------+----------------+------+---------+------------------------+
| +Filter | 5 | x:AS | 0 | 1 | 1 | 0/0 |
| | +----+----------------------------+----------------+------+---------+------------------------+
| +NodeByIdSeek | 6 | x WHERE id(x) = $autoint_0 | 1 | 1 | 1 | 0/0 |
+-------------------+----+----------------------------+----------------+------+---------+------------------------+
Total database accesses: 11, total allocated memory: 64
As you can see, the NodeByIdSeek
is actually performed first and (by design) only yields one result, i.e., now I think that adding labels does not have any chance to increase performance... Sorry I should have checked this a bit sooner.
It is good to know this, Thanks! Also I see the first query without the node labels seems to be more optimized as it accesses the database fewer times (9 accesses) compared to the second query (11 accesses). This difference is mainly due to the filtering step on labels in the second query.
I am gonna close the associated PRs since we no longer need these changes.
Description
The current method
batch_add_links
uses a generic node matching approach. There's a chance to optimize queries by specifying specific node labels within the MATCH clause.Proposed Solution
The change proposed is to use
MATCH (x: {node_label_1}), (y: {node_label_2}))
instead ofMATCH (x), (y)
bypassingnode_label_1
andnode_label_2
as parameters to thebatch_add_links
method.Importance of the Change
This seemingly small modification has a significant impact due to several reasons:
SHOW Indices
can be used (tested on5.15.0
). Other versions may have slightly different syntax or functionality. Additionally, usingEXPLAIN
orPROFILE
with the query shows how the query engine plans and executes them.atlas_probes.py
crawler, it was about 25% faster, and this difference becomes more pronounced at larger scales.Considered Alternatives
I'm not sure about alternative approaches at the moment.
Additional Context
Why does the node's label affect the query performance significantly in Neo4j? (Stack Overflow thread)