VirtualFlyBrain / VFB_neo4j

A python package for writing schema-compliant content to VFB neo4J DBs
Apache License 2.0
0 stars 1 forks source link

Duplicate nodes in pdb.v4 #253

Open dosumis opened 2 years ago

dosumis commented 2 years ago

Does anyone know why we now have 2 nodes with the same iri in pdb.v4 -

MATCH (c) 
WHERE c.iri="http://virtualflybrain.org/reports/VFBexp_FBtp0122237FBtp0118721" 
RETURN c 

? there is only one node for this in the old pdb

v4 has:

<id>: 545644
description: The sum of all cells at the intersection between the expression patterns of P{R74G01-GAL4.DBD} and P{R41G07-p65.AD}.
iri: http://virtualflybrain.org/reports/VFBexp_FBtp0122237FBtp0118721
label: P{R74G01-GAL4.DBD} ∩ P{R41G07-p65.AD} expression pattern
short_form: VFBexp_FBtp0122237FBtp0118721
synonyms: SS00796
uniqueFacets: Expression_pattern,Split
<id>: 178844
curie: VFBexp:FBtp0122237FBtp0118721
description: The sum of all cells at the intersection between the expression patterns of P{R74G01-GAL4.DBD} and P{R41G07-p65.AD}.
has_exact_synonym: {"annotations":{},"value":"GMR_SS00796"},{"annotations":{},"value":"SS00796"},{"annotations":{},"value":"JRC_SS00796"}
iri: http://virtualflybrain.org/reports/VFBexp_FBtp0122237FBtp0118721
label: P{R74G01-GAL4.DBD} ∩ P{R41G07-p65.AD} expression pattern
label_rdfs: P{R74G01-GAL4.DBD} ∩ P{R41G07-p65.AD} expression pattern
qsl: P1R74G014GAL44DBD1_8_P1R41G074p654AD1_expression_pattern
short_form: VFBexp_FBtp0122237FBtp0118721
sl: P1R74G014GAL44DBD1_8_P1R41G074p654AD1_expression_pattern
uniqueFacets: Expression_pattern

KB has

<id>:462066
description:The sum of all cells at the intersection between the expression patterns of P{R74G01-GAL4.DBD} and P{R41G07-p65.AD}.
iri:http://virtualflybrain.org/reports/VFBexp_FBtp0122237FBtp0118721
label:P{R74G01-GAL4.DBD} ∩ P{R41G07-p65.AD} expression pattern
short_form:VFBexp_FBtp0122237FBtp0118721
synonyms:SS00796,GMR_SS00796,JRC_SS00796
dosumis commented 2 years ago

The simpler one looks like it was introduced during side loading of expression data. Which would fit with a failure to merge.

Code: feature_tools.FeatureMover. gen_split_ep_feat has

self.ni.add_node(labels=['Class'],
                             IRI=iri,
                             attribute_dict=ad)

https://github.com/VirtualFlyBrain/VFB_neo4j/blob/master/src/uk/ac/ebi/vfb/neo4j/flybase2neo/feature_tools.py#L413

--->

statement = "MERGE (n:%s { iri: '%s' }) set n.short_form = '%s'" ...

https://github.com/VirtualFlyBrain/VFB_neo4j/blob/master/src/uk/ac/ebi/vfb/neo4j/KB_tools.py#L497

So Merge should work fine as long as the target node has the :Class neo label and iris match. From inspection of the DBs, this seems to be the case

Testing merge behavior against pdb-dev (also on v4):

MERGE (c:Class {iri: 'http://virtualflybrain.org/reports/VFBexp_FBtp0122237FBtp011872'} ) SET c.fu = 'bar'

Adds yet another class. But queries between these classes look broken:

image

Very confusing. Could there be some character encoding issue or indexing bug?

Robbie1977 commented 2 years ago

There is no duplicate showing at http://pdbl.p2.virtualflybrain.org/browser/ so it's after the generic pipeline has loaded as part of the sideloding as the duplicate is in http://pdbsl.p2.virtualflybrain.org/browser/

I've initially ruled out the first step https://github.com/VirtualFlyBrain/pipeline/blob/pipeline2/process.sh As nothing like what we are looking for is added

[rancher@parsley jenkins-LoadPDB2-175]$ cat add_refs_for_anat.out | grep VFBexp
Processing chunk of 40 of 90 starting with: b'OPTIONAL MATCH (s:Class { short_form:\'VFBexp_FBtp0084107\' }) OPTIONAL MATCH (o:Individual { short_form:\'Unattributed\' }) FOREACH (a IN CASE WHEN s IS NOT NULL THEN [s] ELSE [] END | FOREACH (b IN CASE WHEN o IS NOT NULL THEN [o] ELSE [] END | MERGE (a)-[re:has_reference]->(b) SET re.type = \'Annotation\' SET re.typ = "syn" SET re.scope = "has_exact_synonym" SET re.value = [\'Erm-GAL4 expression pattern\'] SET re.label = \'has_reference\' SET re.short_form = \'references\' SET re.iri = \'http://purl.org/dc/terms/references\' )) RETURN { `VFBexp_FBtp0084107`: count(s), `Unattributed`: count(o) } as match_count'
Processing chunk of 43 of 90 starting with: b'OPTIONAL MATCH (s:Class { short_form:\'VFBexp_FBtp0061640\' }) OPTIONAL MATCH (o:Individual { short_form:\'Unattributed\' }) FOREACH (a IN CASE WHEN s IS NOT NULL THEN [s] ELSE [] END | FOREACH (b IN CASE WHEN o IS NOT NULL THEN [o] ELSE [] END | MERGE (a)-[re:has_reference]->(b) SET re.type = \'Annotation\' SET re.typ = "syn" SET re.scope = "has_exact_synonym" SET re.value = [\'Ktl-GAL4 expression pattern\'] SET re.label = \'has_reference\' SET re.short_form = \'references\' SET re.iri = \'http://purl.org/dc/terms/references\' )) RETURN { `VFBexp_FBtp0061640`: count(s), `Unattributed`: count(o) } as match_count'
Processing chunk of 44 of 90 starting with: b'OPTIONAL MATCH (s:Class { short_form:\'VFBexp_FBtp0060060\' }) OPTIONAL MATCH (o:Individual { short_form:\'Unattributed\' }) FOREACH (a IN CASE WHEN s IS NOT NULL THEN [s] ELSE [] END | FOREACH (b IN CASE WHEN o IS NOT NULL THEN [o] ELSE [] END | MERGE (a)-[re:has_reference]->(b) SET re.type = \'Annotation\' SET re.typ = "syn" SET re.scope = "has_exact_synonym" SET re.value = [\'P{GMR40H02-GAL4} expression pattern\'] SET re.label = \'has_reference\' SET re.short_form = \'references\' SET re.iri = \'http://purl.org/dc/terms/references\' )) RETURN { `VFBexp_FBtp0060060`: count(s), `Unattributed`: count(o) } as match_count'
Processing chunk of 50 of 90 starting with: b'OPTIONAL MATCH (s:Class { short_form:\'VFBexp_FBtp0062535\' }) OPTIONAL MATCH (o:Individual { short_form:\'Unattributed\' }) FOREACH (a IN CASE WHEN s IS NOT NULL THEN [s] ELSE [] END | FOREACH (b IN CASE WHEN o IS NOT NULL THEN [o] ELSE [] END | MERGE (a)-[re:has_reference]->(b) SET re.type = \'Annotation\' SET re.typ = "syn" SET re.scope = "has_exact_synonym" SET re.value = [\'P{GMR73C07-GAL4} expression pattern\'] SET re.label = \'has_reference\' SET re.short_form = \'references\' SET re.iri = \'http://purl.org/dc/terms/references\' )) RETURN { `VFBexp_FBtp0062535`: count(s), `Unattributed`: count(o) } as match_count'
Processing chunk of 65 of 90 starting with: b'OPTIONAL MATCH (s:Class { short_form:\'VFBexp_FBtp0122331FBtp0118760\' }) OPTIONAL MATCH (o:Individual { short_form:\'Unattributed\' }) FOREACH (a IN CASE WHEN s IS NOT NULL THEN [s] ELSE [] END | FOREACH (b IN CASE WHEN o IS NOT NULL THEN [o] ELSE [] END | MERGE (a)-[re:has_reference]->(b) SET re.type = \'Annotation\' SET re.typ = "syn" SET re.scope = "has_exact_synonym" SET re.value = [\'LH1614\'] SET re.label = \'has_reference\' SET re.short_form = \'references\' SET re.iri = \'http://purl.org/dc/terms/references\' )) RETURN { `VFBexp_FBtp0122331FBtp0118760`: count(s), `Unattributed`: count(o) } as match_count'
[rancher@parsley jenkins-LoadPDB2-175]$ cat expand_xrefs.out | grep VFBexp
[rancher@parsley jenkins-LoadPDB2-175]$ cat import_pub_data.out | grep VFBexp

Starting on the next stages...