VirtualFlyBrain / VFB_neo4j

A python package for writing schema-compliant content to VFB neo4J DBs
Apache License 2.0
0 stars 1 forks source link

Duplicate creation by pattern.add_anatomy_image_set #90

Open Robbie1977 opened 6 years ago

Robbie1977 commented 6 years ago

It might be an idea to flag duplicate labels (at minimum) before the creation of new nodes.

It's too easy at the moment to create duplicate image sets.

dosumis commented 6 years ago

I thought there was already a uniqueness constraint in the KB to prevent this, but seems not.

Trying to add one but finding some duplicates e.g.



{   "iri": "http://virtualflybrain.org/reports/VFB_00101081",  
 "short_form": "VFB_00101081",  
 "label": "L1 CNS neuron A00c_A8l?" }

{   "iri": "http://virtualflybrain.org/reports/VFB_00101086",   
"short_form": "VFB_00101086",   
"label": "L1 CNS neuron A00c_A8l?" }

``
dosumis commented 6 years ago

We have a problem!

image

I get the same figure when I specify different VFB IDs:

image

I don't know if these are hooked up to public datasets (will check).

Suggested fix: Python script picks highest number of any duplicate to obsolete. Obsoletion consists of setting: is_obsolete: True; Extending the label with ' - OBSOLETE'; Adding a comment pointing to the valid node. TBD: should we add an edge pointing to the valid node?

Robbie1977 commented 6 years ago

As these all recent additions, we could simply fix the loader script and clear out the duplicates. We'd need to check each set to ensure we aren't losing any edges.

dosumis commented 6 years ago

We'd need to check each set to ensure we aren't losing any edges.

The Travis cron job checks everything linked to a dataset - and everything is passing on that except for a minority of Dickson (Result: inds_in_datset: 5627 ; Compliant with pattern: 5387). This suggests to me that it is safe to obsolete every duplicate anatomical individual connected to a dataset + its connected channel, nominating the highest (anatomical individual) ID for obsoletion.

I've been running more analyses. I think the problem is smaller than my original post suggests (most datasets only have a handful of affected individuals - the two exceptions being "Dickson2017" & "Eichler2017".

Still puzzling over some of the query results though. Will post more once I have a clearer idea what's going on.