INCF / neuroshapes

Open schemas for F.A.I.R. neuroscience data
https://incf.github.io/neuroshapes/
Creative Commons Attribution 4.0 International
38 stars 25 forks source link

Shouldn't a Person email be unique ? #196

Open wizmer opened 5 years ago

wizmer commented 5 years ago

Hi,

Due to the async nature of Nexus, I have mistakenly created multiple instance of a Person representing the actual same person. It is due to a bad design on my side but on the other hand I was wondering if it would be doable/beneficial to constraint the unicity of the email address directly in the schema ?

Thanks

MFSY commented 5 years ago

Hi @wizmer, It is certainly beneficial to be able to constraint the unicity of a given property (for example email) but really hard to do with SHACL and at scale. Unfortunately, I don't think we have that capacity right now. But if you are using pyxus to interact with the Nexus API, then there is an option to help make sure that an entity is not submitted twice.

The workaround is to add a schema:identifier property to the person instance and put the person email as value. Then the find_by_identifier function can be used to check whether there is an already registered person with a given email value through search. You can generalize it by using find_by_field if you want to use another field as identifier.

I suggest also to use ORCID ids as person identifier whenever possible (at least for people working in research context).

May be @olinux can weigh in and help.

olinux commented 5 years ago

Hi everyone, Here's my two cents: Since we have/had the same provblem, we've introduced the schema:identifier in all our instances (as a hash) which represents a value defining an entity to clearly distinguish different entities (regardless of their generated UUID). The availability and strength of these keys depends on your data. So if for your data an email is distinctive enough (e.g. probability is high enough that if these e-mail addresses are re-used after a person has left the organization and appears in another meaning in your data) you can go with it as a key which is checked for existence before uploading. Since indices in elasticsearch as well as blazegraph are updated asynchronously, it could still happen that two uploads in short time will happen (the check for existence bases on either elasticsearch or blazegraph and could report an instance to be non-existent while the first message of its creation is still in the queue). If you don't want to introduce complex and (time)expensive synchronization mechanisms (such as client-side locks) you will not be able to evict that threat as far as we can tell and you need to introduce cleanup mechanisms. If you're interested in more detail what we're doing to prevent these things to happen, feel free to ask - so we can give you a short tour.

wizmer commented 5 years ago

Thanks to both of you for the answers,

@MFSY Yes, that's what I was afraid of. After all, SHACL was developed for the Web. How could you guaranty unicity at the Web scale ?

@olinux Well, I am not using Pyxus but @genric entity-management library. I am exactly in the case that you described, where indices are not updated yet so instances were created multiple times. I am currently implementing a client side lock that blocks until the indices are updated. I think I have no other choice. You already implemented such thing in Pyxus ? If yes, I'd be interested in having a look.

jdcourcol commented 5 years ago

@MFSY what is the strategy we should adopt for the BBP Nexus database ? I thought so far the email was the unique identifier. Is that assumption still valid ?

MFSY commented 5 years ago

Hi @jdcourcol,

I propose we take this discussion offline. I'll organize a call for that to revisit how identifiers are handled with the latest Nexus development. But I can say that while emails are used as identifiers on some data, we know this is not a final solution. People can change emails, have multiple emails from different institutions and there is a risk of repurposing already used emails.

As you can see from the question and answers above, it is not trivial to enforce entity unicity based on email value or any other property value with Nexus v0. With Nexus V1, we are a in a better position to implement best practicies in term of entity, person identification. For example we started to look at ORCID as person identifier (a person can still have one or many emails). We think it make sense in research context to use Orcid.

Of course the devil is in the details and we need to coordinate on this.