Unique public IDs for items

dllahr commented 12 years ago

Current agreement:

Here are some conclusions I propose based on the discussion:

The requirement that private CAP systems do not want to let anyone know how many items they are creating rules out having a central system provide IDs for every item.
the requirement that users have a unique ID for items across all CAP instances precludes keeping things entirely local and then getting a synonym when publishing to public BARD
There is a public BARD CAP / DW somewhere - use this system to coordinate the prefixes

original statement

According to users, we must have unique, public IDs available for items in the system: assay definitions experiments projects

This is regardless of whether the item is created in a local instance or a public instance - they must not have the same ID.

example use cases:

referring to results in a publication
More general: User creates an Assay Definition (could be Project, Experiment etc) in a CAP (Lets call it CAP A). CAP A returns an ID. User requires this ID to be unique across all BARD systems (both CAP's and warehouse's) in the world.

Here are 5 possible solutions

a central, public source provides IDs to itself and local instances of BARDS. problem: requires private instances to potentially hit the central system often and every time they have need a new item, reducing privacy
a public ID contains a prefix indicating where it originated. public CAP uses CAP. Local instances request their prefix when installing / setting up, and then use that.
all IDs are local and systems will maintain list of synonyms - problem: conflicts with user requirement of uniqueness
decentralized system to generate IDs and/or prefixes. examples:
- something like bitcoin http://en.wikipedia.org/wiki/Bitcoin
- LSID: http://en.wikipedia.org/wiki/LSID problem: is it maintained?

Pros and cons of Central:

pros: least ambiguity, easy to publish from local to public or local to local
cons: have to maintain central system to generate id's. local instances will be letting outside world how much they are doing (they are loathe to share anything about what they are working on)

Pros and cons of Prefix:

pros: mostly non-ambiguous, relatively easy to publish from local to public to local to local. central server / registry of prefixes is used only rarely. Local instances are not "publishing" their rates of item generation
cons: could be ambiguous IDs if users fail to include prefixes. Still have to maintain a central system to register prefixes

Pros and cons of synonyms:

pros: no central server required for anything
cons: most ambiguous IDs. individual systems need to maintain synonyms when they import items from other systems. Users do not want to deal with synonyms.

Pros and cons of decentralized

pros: prefix or centralized system minimal ambiguity in IDs. Easy to publish from one instance to another
cons: experimental / risky - can we duplicate the bitcoin algorithm / process in our domain? How well established is LSID?

jasiedu commented 12 years ago

I would like to understand why users want these Identifiers. I would also like to understand why and how it relates to public/private/local instances of BARD.

dllahr commented 12 years ago

Here are some reasons:

publication reference
communication with each other
searching the system On Nov 9, 2012 7:38 PM, "jasiedu" notifications@github.com wrote:

I would like to understand why users want these Identifiers. I would also like to understand why and how it relates to public/private/local instances of BARD.

— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10249698.

schatwin commented 12 years ago

As well ask why users use the CID and AID numbers from PubChem.

It's a shorter, more accurate name that simplifies communication.

Their usage depends on scope and context of the communicatiom

Simon Sent from my Verizon Wireless BlackBerry

-----Original Message----- From: dllahr notifications@github.com Date: Fri, 09 Nov 2012 18:36:47 To: broadinstitute/BARDBARD@noreply.github.com Reply-To: broadinstitute/BARD reply@reply.github.com Subject: Re: [BARD] Unique public IDs for items (#3)

Here are some reasons:

publication reference
communication with each other
searching the system On Nov 9, 2012 7:38 PM, "jasiedu" notifications@github.com wrote:

I would like to understand why users want these Identifiers. I would also like to understand why and how it relates to public/private/local instances of BARD.

— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10249698.

Reply to this email directly or view it on GitHub: https://github.com/broadinstitute/BARD/issues/3#issuecomment-10251069

chungtd commented 12 years ago

As long as we somehow link back to PubChem we need these identifiers

CID = is a canonical representation of a chemical structure – not an actual physical sample – as such it cannot be directly matched to any data, biological or chemical analytical (e.g. Purity or sample weight). All data in PubChem has a matching SID. You cannot measure the biological property of an CID. A query of CID should expose all associated SIDs and their associated data. If different SIDs associated with an CID have widely disparate biological activity values for the same AID, this could indicate batch to batch, lot-to-lot, or storage, or physical state of sample (solution, in a plate, a vial, or from a different vendor). CIDs assignments take some time and curation to ensure uniqueness when a "new" lot of a substance is registered. PubChem checks to see the new SID maps to an already existing CID before issuing a new one.

SID = is a particular instance of a physical sample and thus a "real" physical compound sample. SID's can be assigned rather rapidly by PubChem curators, since they are unique by definition. Please note different batchs/lots of a substance or "compound" will have different SIDs, but the same batch can have different SIDs if it has been sent to a different site who independently registers it. E.g. A compound is ordered from a commercial vendor (and has an SID). The receiving user can use the vendors SID but more typically it is re-registered and receives a new SID. There is some sense in this, because now this substance is stored under different conditions and by different users so may be handled different and thus may eventually change.

TC

From: jasiedu notifications@github.com<mailto:notifications@github.com> Reply-To: broadinstitute/BARD reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, November 9, 2012 4:38 PM To: broadinstitute/BARD BARD@noreply.github.com<mailto:BARD@noreply.github.com> Subject: Re: [BARD] Unique public IDs for items (#3)

I would like to understand why users want these Identifiers. I would also like to understand why and how it relates to public/private/local instances of BARD.

— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10249698.

jasiedu commented 12 years ago

In that case they can use the id's that the system supplies. The CID, AID, SID supplied by pubchem are just numbers (my guess is that they are database ids). I guess the thing i do not understand most is why those numbers should be unique on every instance of BARD. I think we are adding a complexity that we do not need.

jbittker commented 12 years ago

The idea is that you can have a local instance of BARD, say inside a pharma company, and so a search against both that and the public BARD and have the merged results returned. If the assay IDs are not unique I'm guessing that search will not be possible.

Josh

On Sat, Nov 10, 2012 at 7:24 AM, jasiedu notifications@github.com wrote:

In that case they can use the id's that the system supplies. The CID, AID, SID supplied by pubchem are just numbers (my guess is that they are database ids). I guess the thing i do not understand most is why those numbers should be unique on every instance of BARD. I think we are adding a complexity that we do not need.

— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10254662.

jasiedu commented 12 years ago

But if you have private data, why would you publish it to the public BARD? Lets suppose that you do publish that data to the public BARD, why would you want to merge the results to the data in the private BARD (since they are the same)?

This seems to me to be an implementation issue. Perhaps we should first figure out how we are going to implement the public/private/local BARD thing and then come back to this issue?

jbittker commented 12 years ago

They wouldn't publish private data to public BARD, but they'd want to have their search results include both public and their private data- the results are not the same. Say that Novartis is running SuperSecret Kinase against their library and they want to compare those results to the MLPCN results for public kinase results, to look for specificity, analogs, etc. So they search for results of some molecules against all assays annotated as kinase targets, and the molecular spreadsheet shows both their internal experiment results and the public experiment results. If the assay IDs were the same the results would not return properly.

Josh

On Sat, Nov 10, 2012 at 9:12 AM, jasiedu notifications@github.com wrote:

But if you have private data, why would you publish it to the public BARD? Lets suppose that you do publish that data to the public BARD, why would you want to merge the results to the data in the private BARD (since they are the same)?

This seems to me to be an implementation issue. Perhaps we should first figure out how we are going to implement the public/private/local BARD thing and then come back to this issue?

— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10255490.

jasiedu commented 12 years ago

Josh, i see exactly what you are talking about, but this is a display issue more than anything. The agent that aggregates the data could prefix these "IDS" at run time. We do not need a central ID generator to do this.

dllahr commented 12 years ago

In that case if you are never publishing you are correct. But some people will publish items from the private system to the public system. Some people will decide that they can publish something after they are "done" with it. Others will do it as part of grant requirements.

On Sat, Nov 10, 2012 at 10:24 AM, jasiedu notifications@github.com wrote:

Josh, i see exactly what you are talking about, but this is a display issue more than anything. The agent that aggregates the data could prefix these "IDS" at run time. We do not need a central ID generator to do this.

— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10256165.

617 714 7868 dlahr@broadinstitute.org

jasiedu commented 12 years ago

I think we should work out what it means to "publish" before we start talking about these namespace identifiers. Are we talking about the CAP or the warehouse or both? Does this relate to the public/private/local BARD conversation that we started having in the EWG a couple of months back but seems to have been pushed to year 2 or this is entirely something new? In any case can we get some very specific use cases so we can better understand what we are all talking about?

dllahr commented 12 years ago

Good call on the use cases.

Publish has 2 meanings

publication in journal. This is absolutely a requirement
"publish" to the public CAP / warehouse

Regarding warehouse and CAP the ID must be the same to user. I guess it doesn't matter what ID is used in any warehouse as long as the user only ever sees the same ID for the same item regardless of whether they are looking at local CAP, public CAP, local web client/thick client, or public web client/thick client.

This is not specifically about local/public instances, but it does affect that. Regardless of that we need to establish how ID's are handles in CAP and the warehouse even in just the public systems.

On Sat, Nov 10, 2012 at 11:22 AM, jasiedu notifications@github.com wrote:

I think we should work out what it means to "publish" before we start talking about these namespace identifiers. Are we talking about the CAP or the warehouse or both? Does this relate to the public/private/local BARD conversation that we started having in the EWG a couple of months back but seems to have been pushed to year 2 or this is entirely something new? In any case can we get some very specific use cases so we can better understand what we are all talking about?

— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10256734.

617 714 7868 dlahr@broadinstitute.org

jasiedu commented 12 years ago

So let me see if i can summarize one of the use cases that you mentioned.

Say i create an Assay (could be Project, Experiment etc) in a CAP (Lets call it CAP A) CAP A gives me back an ID. I want this ID to be unique across all BARD systems (both CAP's and warehouse's) in the world.

Is that a fair summary?

jasiedu commented 12 years ago

We should take a look at LSID's and see if it is something we could use, assuming that the "crude" use case above is what our users want.

dllahr commented 12 years ago

Yes I would say that is a fair description of the use case. LSID http://en.wikipedia.org/wiki/LSID

looks like it is possible solution.

On Sat, Nov 10, 2012 at 12:24 PM, jasiedu notifications@github.com wrote:

We should take a look at LSID's and see if it is something we could use, assuming that the "crude" use case above is what our users want.

— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10257436.

617 714 7868 dlahr@broadinstitute.org

rajarshi commented 12 years ago

Looks like a lot of good discussion has taken place. A few comments:

LSID looks nice, but to what extent is it supported and in use (outside the taxonomy community)? Many of the links in the Wikipedia page are dead. Also, it seems that we would need to deploy a LSID resolver
I see CID & SID staying in use simply because we don't deal with chemical registration
I expect AID's will be deprecated as primary identifiers (though they will of course be available) and replaced by BARD experiment identifiers
CAP ID's are stored in the warehouse - allowing the CAP to refer to data that has been sent to the warehouse, without having to keep track of the corresponding BARD ID's
I agree that the from the users point of view, they should not have to worry about the CAP or warehouse. This suggests that until data has been transferred into the warehouse, the user does not get a final identifier from the CAP. (Similar manner to what Pubchem does now)
I agree with Jacob that a search across public and local repositories can be merged by the agent running the search (say by prepending a prefix etc). The downside is that the agent will not know whether two instances of an entity are identical or not based on an identifier. Extra logic could certainly be applied by the agent to test equality of entities - but it seems wasteful when unique identifiers would do the job
I think it's safe to assume that something will be published from a local instance to the main public instance - we should prepare for that eventuality

jasiedu commented 12 years ago

Dave, do we have a time line on this issue? It seems the part that deals with the CAP ID/Warehouse ID sync should happen pretty soon

schatwin commented 12 years ago

I must admit, I don't see the level of confusion that this solution is designed to avoid.

I think it's reasonable that when an assay (or experiment) is transferred to the Public BARD, it should get a new number in the public system. It may be the local installation will need a mechanism to convert their local, previously private, number to the now public one - or not. The public assay will appear in the local install if and when the local install is refreshed from the public BARD. At that time the local system will have a duplicate pair of assays and we should probably be able to help the user sort this out; though I see a wide variety of requirements about how to sort it out.
If we say that the number has to be preserved when copying form private to public, then we need a central source of IDs. We can do this in two ways - all IDs are served form the center (public) or we have a registry wherein new ids generated locally are sent to a central registry for cross-mapping at creation. I suspect that neither of these will be very acceptable to commercial corporations who dislike the idea that anything about their research (including what queries they run on a public domain system) gets out into the world.

And frankly this problem is well understood and handled in the chemistry domain for CIDs, so similar approaches for ADIDs should be acceptable and not have a high risk of confusion.

I see the public ADIDs becoming the de-facto replacement for AIDs over time, but I don't see a need to enforce high levels of uniqueness. We might be well advised to have a storage place for a local assay identifier in the DB which local users can use as an extension from their own previously used system (even it was just an Excel spreadsheet!)

jeremyjyang commented 12 years ago

Some points which I hope are helpful:As TC and Simon emphasize, the precedence of CIDs and SIDs is
    important to review and consider.  Note further that there is a
    fundamental difference between SIDs, which are information-free
    identifiers, and CIDs, which have a one-to-one correspondence
    with (OpenEye) canonical smiles.  (You could argue that the
    number itself is information free, but two different CIDs must
    have different Kekule structures).  What class of ID will Assay
    Definition ID be?  I presume like CID, unique by virtue of a
    rigorous canonicalization algorithm for assays.
  (If I understood the plan...) A canonicalized assay based
    ADID, like CIDs, is not just a one time implementation task, but
    an ongoing commitment.  What happens when BARD v2.0 designers
    want to improve on the algorithm?  Maybe a finer grained assay
    classification results in two assay definitions (ADs) where
    there was one?  (This is a problem for CIDs which may keep
    PubChem tied to one version of OpenEye, I'm not sure.)
  If unique ADs are algorithmically determined, private
    instances should be able to merge their assays and search
    results automatically.
  Like SIDs, PubChem AIDs are information-free.  This is at the
    root of why BARD is so important.  So I certainly agree with the
    attention  to this discussion.  
On 11/11/2012 06:40 PM, Simon wrote:

  And frankly this problem is well understood and handled in the
    chemistry domain for CIDs, so similar approaches for ADIDs
    should be acceptable and not have a high risk of confusion.

-- 
    Jeremy J Yang | Mgr, Systems & Programming | UNM
    Translational Informatics Divisionjjyang@salud.unm.edu | http://medicine.unm.edu/informatics/
    505.925.7533 | fax:505.925.7625 | mobile:505.350.3256
    700 Camino de Salud NE | IDTC 2150 | MSC09 5025 | Albuquerque,
    NM 87131 | USA
    ---"We think in generalities, but we live in detail." - Alfred
      North Whitehead

dllahr commented 12 years ago

Jacob - I don't have a specific timeline, but as you say it should be soon and it should be in the "beta" release.

Simon - to your 2nd option, instead of having the ID generated centrally we also discussed have the prefix registered centrally. That's done only once when setting up a local BARD instance.

Jeremy - I think algorithmically determined ADID's are a neat idea. We will certainly have to look into that.

jasiedu commented 12 years ago

It seems we need to have a central server somewhere that doles out these IDs (or prefix). We also need to figure out where this would be hosted. I think though that the immediate thing to fix is to sync the ids between the CAP and the warehouse. I am not fully convinced that a Universal ID is the way to go because I still have not seen a use case where Perhaps what we need is for clients to associate a URL(DNS name, since these are unique) to every resource that they deposit and then we should make sure that our search mechanism allows one to also specify a DNS name together with an ID. This way we would not need to build and maintain a different system for maintaining the uniqueness of the IDS. Just a thought

schatwin commented 12 years ago

@jeremyjyang But the CID is not an encoding of the SMILES string - it's a simple sequential number issued by PubChem when a structure is registered. It is guaranteed to represent a unique SMILES because PubChem enforces that during the registration process. How they do it (which version of OpenEye, e.g.) is not relevant to that fact of uniqueness.

In the same way BARD will issue a new ADID for each unique assay registered in the CAP and will have a process to determine that the assay is unique. When the rules for uniqueness are changed we need to specify whether the new rules are retrospective and might eliminate some previous ADIDs. This could be nasty and will require a 'redirection' facility so that people using the old deprecated ADID do not get lost. I'd be interested to know what happens with CIDs in this circumstance?

BARD is complicated by the use of 'local' installs and the ramifications of publication of a previously private assay into the public domain. How about a really simple solution: ensure that all assays registered in a private installation get numbers greater than 10,000,000 by setting the number generator to start there?

jeremyjyang commented 12 years ago

The details of PubChem CIDs may be a useful case study ("precedent"
is what I meant to type before not "precedence").  PubChem actually
devised a special "Kekule canonical smiles" which was not an
official OpenEye algorithm, mindful that aromaticity can be
debatable and that aromatic smiles in databases result in import
errors with different tools.  Evan B. et al. showed ingenuity as
there can be multiple valid Kekule smiles for an aromatic compound
(the algorithm removes bond order then re-determines so it is
canonical).  To my knowledge the rules for PubChem structure
canonicalization have not changed (I will try to verify that), and
that promise, and its consequences, as it may apply to ADIDs, are a
big deal, both in practical and conceptual ways.On 11/12/2012 12:29 PM, Simon wrote:

  @jeremyjyang 
    But the CID is not an encoding of the SMILES string - it's a
    simple sequential number issued by PubChem when a structure is
    registered. It is guaranteed to represent a unique SMILES
    because PubChem enforces that during the registration process.
    How they do it (which version of OpenEye, e.g.) is not relevant
    to that fact of uniqueness.
  In the same way BARD will issue a new ADID for each unique
    assay registered in the CAP and will have a process to determine
    that the assay is unique. When the rules for
    uniqueness are changed we need to specify whether the new rules
    are retrospective and might eliminate some previous ADIDs. This
    could be nasty and will require a 'redirection' facility so that
    people using the old deprecated ADID do not get lost. I'd be
    interested to know what happens with CIDs in this circumstance?
  BARD is complicated by the use of 'local' installs and the
    ramifications of publication of a previously private assay into
    the public domain. How about a really simple solution: ensure
    that all assays registered in a private installation get numbers
    greater than 10,000,000 by setting the number generator to start
    there?

    —
    Reply to this email directly or view
      it on GitHub. 

-- 
    Jeremy J Yang | Mgr, Systems & Programming | UNM
    Translational Informatics Divisionjjyang@salud.unm.edu | http://medicine.unm.edu/informatics/
    505.925.7533 | fax:505.925.7625 | mobile:505.350.3256
    700 Camino de Salud NE | IDTC 2150 | MSC09 5025 | Albuquerque,
    NM 87131 | USA
    ---"We think in generalities, but we live in detail." - Alfred
      North Whitehead

benralexander commented 12 years ago

One note about possible use cases: providing identifiers to people wishing to store new, private compounds on their private instance of Bard may be a non-issue. I have a hard time believing that any company with enough money to develop a NCE will be willing to risk their investment by loading that information into our system (imagining that somehow their molecular structure might be exposed). Instead they’ll use their own well-protected chemical registration system to keep track of their molecules until they go public. Undoubtedly we WILL need to assign IDs to compounds that our introduced to our public Bard system, but I think that realistically these are the only novel chemical identifiers we’ll need.

dllahr commented 12 years ago

Thanks all for the continued great discussion. I'd like to work towards some concrete items.

Here are the given requirements:

Assay def, experiment, project ID unique across all BARD systems public and private
private systems not required to hit a central server to get ID for every item (privacy)

Here are some conclusions I propose based on the discussion:

The second requirement seems to rule out having a central system provide IDs for every item.
user uniqueness requirement precludes keeping things entirely local and then getting a synonym when publishing to public BARD
There is a public BARD CAP / DW somewhere - use this system to coordinate the prefixes

On Wed, Nov 14, 2012 at 1:34 PM, Ben R. Alexander notifications@github.comwrote:

One note about possible use cases: providing identifiers to people wishing to store new, private compounds on their private instance of Bard may be a non-issue. I have a hard time believing that any company with enough money to develop a NCE will be willing to risk their investment by loading that information into our system (imagining that somehow their molecular structure might be exposed). Instead they’ll use their own well-protected chemical registration system to keep track of their molecules until they go public. Undoubtedly we WILL need to assign IDs to compounds that our introduced to our public Bard system, but I think that realistically these are the only novel chemical identifiers we’ll need.

— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10377854.

617 714 7868 dlahr@broadinstitute.org

rajarshi commented 12 years ago

On Wed, Nov 14, 2012 at 8:23 PM, dllahr notifications@github.com wrote:

user uniqueness requirement precludes keeping things entirely local and then getting a synonym when publishing to public BARD

While this seems like the best way given the assumptions above, I fear that we end up with the tangle that is gene names and their synonyms.

There is a public BARD CAP / DW somewhere - use this system to

coordinate the prefixes

I think this is reasonable

Rajarshi Guha | http://blog.rguha.net NIH Center for Advancing Translational Science

broadinstitute / BARD

Unique public IDs for items #3

Current agreement:

original statement