InternetHealthReport / internet-yellow-pages

A knowledge graph for Internet resources
GNU General Public License v3.0
39 stars 16 forks source link

Enhancement: Optimize Query Performance by Specifying Node Labels in `batch_add_links` Method #102

Closed mohamedawnallah closed 8 months ago

mohamedawnallah commented 9 months ago

Description

The current method batch_add_links uses a generic node matching approach. There's a chance to optimize queries by specifying specific node labels within the MATCH clause.

def batch_add_links(self, type, links, action='create'):
...
create_query = f"""WITH $batch AS batch
UNWIND batch AS link
    MATCH (x), (y) 
    WHERE ID(x) = link.src_id AND ID(y) = link.dst_id
    CREATE (x)-[l:{type}]->(y)
    WITH l, link
    UNWIND link.props AS prop
        SET l += prop """

if action == 'merge':
    create_query = f"""WITH $batch AS batch
    UNWIND batch AS link
        MATCH (x), (y)
        WHERE ID(x) = link.src_id AND ID(y) = link.dst_id
        MERGE (x)-[l:{type}]-(y)
        WITH l,  link
        UNWIND link.props AS prop
            SET l += prop """
...

Proposed Solution

The change proposed is to use MATCH (x: {node_label_1}), (y: {node_label_2})) instead of MATCH (x), (y) bypassing node_label_1 and node_label_2 as parameters to the batch_add_links method.

Importance of the Change

This seemingly small modification has a significant impact due to several reasons:

Considered Alternatives

I'm not sure about alternative approaches at the moment.

Additional Context

Why does the node's label affect the query performance significantly in Neo4j? (Stack Overflow thread)

m-appel commented 9 months ago

I think this is a good idea, so if you want, you can implement it. I think we looked at this before and decided to keep it simple, but if you measured an actual difference in performance, then it's well worth the change.

It would be good to have this as optional parameters so that we do not have to update all crawlers at once.

mohamedawnallah commented 9 months ago

I've just attempted this optimization several times across various crawlers, including the atlas_probes.py crawler, and it consistently appears to be a false positive in terms of optimization.

m-appel commented 9 months ago

You mean it does not consistently makes relationship creation faster? I tried it with your Atlas measurement crawler and it seemed to get faster, but that might also be because some nodes were cached during the second repetition...

Technically, it should at least not hurt to have this functionality, although I discussed with Romain and we think that the current query should already be fast. We select on exact node ids, i.e., this should be a constant time lookup. But of course we don't know exactly how neo4j organizes the nodes, so maybe a label could still help.

mohamedawnallah commented 9 months ago

You mean it does not consistently make relationship creation faster?

Yes. Maybe something I missed while doing my verification test if it optimizes or not is that I removed all nodes and relationships before starting the crawler this way when we're loading the crawler only the nodes and relationships related to the crawler (e.g: AtlasProbe, ...etc) exist in the neo4j which makes the search space for both cases equal.

Technically, it should at least not hurt to have this functionality, although I discussed with Romain and we think that the current query should already be fast. We select on exact node ids, i.e., this should be a constant time lookup. But of course, we don't know exactly how neo4j organizes the nodes, so maybe a label could still help.

Yes, you are alright. It doesn't hurt and may be using a label to optimize the query 👍

mohamedawnallah commented 9 months ago

I'm gonna submit a PR for the atlas_probes crawler after adding node labels to batch_add_links and will try to do correct benchmarking to optimize the query more.

m-appel commented 8 months ago

Hey, when reviewing your PR I actually found that there is a PROFILE command, which not only gives the query plan, but the number of database accesses as well. I tried for an example query with and without node labels:

PROFILE MATCH (x), (y)
WHERE ID(x) = 731
AND ID(y) = 1055
CREATE (x)-[r:TEST]->(y)
RETURN x, r, y
Planner COST

Runtime SLOTTED

Runtime version 5.0

+-------------------+----+----------------------------+----------------+------+---------+------------------------+
| Operator          | Id | Details                    | Estimated Rows | Rows | DB Hits | Page Cache Hits/Misses |
+-------------------+----+----------------------------+----------------+------+---------+------------------------+
| +ProduceResults   |  0 | x, r, y                    |              1 |    1 |       5 |                    0/0 |
| |                 +----+----------------------------+----------------+------+---------+------------------------+
| +Create           |  1 | (x)-[r:TEST]->(y)          |              1 |    1 |       2 |                    0/0 |
| |                 +----+----------------------------+----------------+------+---------+------------------------+
| +CartesianProduct |  2 |                            |              1 |    1 |       0 |                    0/0 |
| |\                +----+----------------------------+----------------+------+---------+------------------------+
| | +NodeByIdSeek   |  3 | y WHERE id(y) = $autoint_1 |              1 |    1 |       1 |                    0/0 |
| |                 +----+----------------------------+----------------+------+---------+------------------------+
| +NodeByIdSeek     |  4 | x WHERE id(x) = $autoint_0 |              1 |    1 |       1 |                    0/0 |
+-------------------+----+----------------------------+----------------+------+---------+------------------------+

Total database accesses: 9, total allocated memory: 64
PROFILE MATCH (x:AS), (y:AS)
WHERE ID(x) = 731
AND ID(y) = 1055
CREATE (x)-[r:TEST]->(y)
RETURN x, r, y
Planner COST

Runtime SLOTTED

Runtime version 5.0

+-------------------+----+----------------------------+----------------+------+---------+------------------------+
| Operator          | Id | Details                    | Estimated Rows | Rows | DB Hits | Page Cache Hits/Misses |
+-------------------+----+----------------------------+----------------+------+---------+------------------------+
| +ProduceResults   |  0 | x, r, y                    |              0 |    1 |       5 |                    0/0 |
| |                 +----+----------------------------+----------------+------+---------+------------------------+
| +Create           |  1 | (x)-[r:TEST]->(y)          |              0 |    1 |       2 |                    0/0 |
| |                 +----+----------------------------+----------------+------+---------+------------------------+
| +CartesianProduct |  2 |                            |              0 |    1 |       0 |                    0/0 |
| |\                +----+----------------------------+----------------+------+---------+------------------------+
| | +Filter         |  3 | y:AS                       |              0 |    1 |       1 |                    0/0 |
| | |               +----+----------------------------+----------------+------+---------+------------------------+
| | +NodeByIdSeek   |  4 | y WHERE id(y) = $autoint_1 |              0 |    1 |       1 |                    0/0 |
| |                 +----+----------------------------+----------------+------+---------+------------------------+
| +Filter           |  5 | x:AS                       |              0 |    1 |       1 |                    0/0 |
| |                 +----+----------------------------+----------------+------+---------+------------------------+
| +NodeByIdSeek     |  6 | x WHERE id(x) = $autoint_0 |              1 |    1 |       1 |                    0/0 |
+-------------------+----+----------------------------+----------------+------+---------+------------------------+

Total database accesses: 11, total allocated memory: 64

As you can see, the NodeByIdSeek is actually performed first and (by design) only yields one result, i.e., now I think that adding labels does not have any chance to increase performance... Sorry I should have checked this a bit sooner.

mohamedawnallah commented 8 months ago

It is good to know this, Thanks! Also I see the first query without the node labels seems to be more optimized as it accesses the database fewer times (9 accesses) compared to the second query (11 accesses). This difference is mainly due to the filtering step on labels in the second query.

I am gonna close the associated PRs since we no longer need these changes.