InternetHealthReport / internet-yellow-pages

A knowledge graph for Internet resources
GNU General Public License v3.0
39 stars 16 forks source link

Enhancement: Ensure Idempotence for `batch_create_nodes` Method in IYP Data Pipeline #89

Closed mohamedawnallah closed 9 months ago

mohamedawnallah commented 9 months ago

Description

The batch_create_nodes method currently lacks idempotence.

Steps to Reproduce

  1. Download: Obtain any database dump from: https://exp1.iijlab.net/wip/iyp/dumps/
  2. Database Loading: Use either the iyp-loader Docker container or the neo4j-admin utility to load the dump into Neo4j.
  3. Run Pipeline: Execute a specific data pipeline (e.g., atlas_probes.py). It succeeds initially.
  4. Rerun Pipeline: Re-run the same pipeline; it fails and displays an error similar to:
    neo4j.exceptions.ConstraintError: {code: Neo.ClientError.Schema.ConstraintValidationFailed} {message: Node(9222548) already exists with label `AtlasProbe` and property `id` = 1}

Expected Behavior

The batch_create_nodes operation within the data pipeline should be idempotent, allowing successful execution regardless of whether the same or new data is used.

Recordings

Recordings Link

Additional Context

In the context of data pipelines/operations, idempotence ensures that running an operation multiple times produces consistent outcomes or avoids duplicate data. The term refers to an operation that, when applied repeatedly, yields the same result as a single application.

References: Idempotence in Computer Science [Wikipedia]

romain-fontugne commented 9 months ago

I think the idempotent version of that method is https://github.com/InternetHealthReport/internet-yellow-pages/blob/b358b0630bcb6f3df8d99f7141fdbd66ff61766d/iyp/__init__.py#L185. Do we actually have a good reason to use https://github.com/InternetHealthReport/internet-yellow-pages/blob/b358b0630bcb6f3df8d99f7141fdbd66ff61766d/iyp/__init__.py#L291 instead of https://github.com/InternetHealthReport/internet-yellow-pages/blob/b358b0630bcb6f3df8d99f7141fdbd66ff61766d/iyp/__init__.py#L185 ? @m-appel

mohamedawnallah commented 9 months ago

I don’t think batch_create_nodes is an alternative to batch_get_nodes because batch_create_nodes creates the main node label e.g AtlasProbe in the specified script but batch_get_nodes gets a subset of other node labels e.g IP, Prefix that their properties intersected together e.g autonomous system number to create the relationships with the main node label in the script e.g AtlasProbe.

The main problem happens when we create the same node type in neo4j e.g. AtlasProbe multiple times. Let’s go through a typical scenario [of multiple possible scenarios]:

  1. There was a flaky problem that happened in the log e.g. in the atlas_probes.py script and appeared in the log file.
  2. We re-run the data pipeline again manually (assuming no deletion for the node label from the database manually) It would fail basically.
romain-fontugne commented 9 months ago

Ok, I see the difference now. Basically batch_create_nodes allow to create nodes with multiple properties and batch_get_nodes creates nodes with only a single property.

I would argue against changing the CREATE statement to MERGE in batch_create_nodes and even more than that I would recommend to remove the batch_create_nodes. Imagine we have two scripts that creates AtlasProbe nodes. Currently they cannot use both the batch_create_nodes because of the problem you mentioned. If we change the statement to MERGE and they both use the batch_create_nodes, it will give unexpected results because if the two scripts add different properties the MERGE statement will try to make new nodes (although a node with the same id may already exists). If there is a constrain in the database it will through the same error and if there is no constrain we will have two nodes that may represent the same thing. Both are undesirable.

I propose we change the atlas crawler to use batch_get_nodes to create/get nodes and batch_add_properties to add the extra properties. This looks less efficient but in practice I don't think it would take much longer.

And I want to stress that this change is important to make sure we don't have conflicts between crawlers.

m-appel commented 9 months ago

I was actually working on this yesterday, but I could not find a solution that I like. We currently have three functions that can create nodes:

  1. get_node: This function gets or creates a single node with multiple properties. The catch here is that we automatically extract the properties that have constraints, MERGE on them and add/overwrite the remaining properties.
  2. batch_get_nodes: Like already mentioned, this function can only get/create nodes based on a single property. Although we do not enforce it, the expectation is that the single property is unique (this is how we use it, even though not all cases where we use it actually have a constraint, e.g., the url property of the URL node).
  3. batch_create_nodes: This function strictly creates multiple nodes, each of which can have multiple properties. I created this when I wrote the AtlasProbe crawler since batch_add_properties did not exist then, but I agree it's a bit dangerous.

What irked me a bit is that we can get nodes based on multiple properties with get_node, but not with any batch function. Is our assumption that every node has exactly one identifying property? There are a few examples of get_node where we create nodes with multiple properties, none of them are constraint, so I guess we treat the combination of them as the identifier? (see [0], [1]. [2]) Currently, we can not fetch these in batches without assuming that one of the properties is unique. (This is not a problem at the moment, since these are very few nodes, but still)

So for the batch function we can either use the two-step approach that Romain proposed, or we could implement something similar to get_node, where the constrained property is automatically extracted, the query MERGEs on that and the other properties are SET. But, I don't think this would work with multiple constrained properties.

In the single-node case we can do

f"""MERGE (a:{label} {dict2str(constraint_prop)})
    SET a += {dict2str(prop)}
    RETURN ID(a)"""

so we just insert the constrained properties as a Cypher map. But I think we can not do this dynamically in an UNWIND. If we have a single constrained property it would work, for example:

label = 'AtlasProbe'
constraint_prop = 'id'
props = [{'id': 1, 'asn_v4': 2497},
         {'id': 2, 'asn_v4': 701}]

query = f"""UNWIND $batch as batch
            MERGE (a:{label} {{{constraint_prop}: batch.id}}
            SET a += batch"""
tx.run(query, batch=batch)

here the constraint_prop is set manually, but could also be automatically inferred like in get_node.

So, either two steps, or something like this.

EDIT: This is what Mohamed proposed in his PR :) I guess the question is: Does each node have exactly one identifying property?

EDIT EDIT: I think there is a way to achieve this with multiple properties, I'll try something.

romain-fontugne commented 9 months ago

What I like in the 'two steps' approach is that the identifier is more explicit in crawlers code. The other approach is simpler to write crawlers but requires that we update the iyp library every time we add a new node type, which is really not clean...

For the two steps approach we could also 'officialise' that nodes have a single identifier per node type and we can create constrains in get_batch_nodes to enforce it. So we can remove some of the explicit constrains we have in init.py, they will be derived when the nodes a created. But then we should also enforce a parameter for the identifier property/value in get_node

m-appel commented 9 months ago

Good timing, and yes I was thinking about something like this (making the signatures of both functions similar). Instead of doing an implicit MERGE on the constrained properties, let's add an "id property" parameter for both functions.

def get_node(label, id_prop, properties)

where label is the node label, id_prop is a list of properties (subset of keys of properties) that should be used as the identifier, and properties is just a dictionary including all properties added to the node. For example:

get_node('AtlasProbe', ['id'], {'id': 1, 'asn_v4': 2497})

would result in this query:

MERGE (a:AtlasProbe {'id': 1})
SET a += {'id': 1, 'asn_v4': 2497}

Similarly, for batch_get_nodes we can just give a list of properties:

properties = [{'id': 1, 'asn_v4': 2497},
              {'id': 2, 'asn_v4': 64496}]
batch_get_node('AtlasProbe', ['id'], properties)

would result in

UNWIND $props AS prop
MERGE (a:AtlasProbe {'id': prop.id})
SET a += prop

For our current use case where we only work on one property, this would introduce a bit of overhead, as the list of properties would not only contain the values, but also the key for every entry. Alternatively, we can keep the current version as batch_get_nodes_by_single_prop or something, and only use this new version when necessary.

romain-fontugne commented 9 months ago

Yes, that looks good to me 👍

romain-fontugne commented 9 months ago

Thanks a lot @mohamedawnallah for bringing this up. @m-appel fixed that in #91, this is a very good improvement. Thank you both!