Closed dllahr closed 12 years ago
I would like to understand why users want these Identifiers. I would also like to understand why and how it relates to public/private/local instances of BARD.
Here are some reasons:
I would like to understand why users want these Identifiers. I would also like to understand why and how it relates to public/private/local instances of BARD.
— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10249698.
As well ask why users use the CID and AID numbers from PubChem.
It's a shorter, more accurate name that simplifies communication.
Their usage depends on scope and context of the communicatiom
Simon Sent from my Verizon Wireless BlackBerry
-----Original Message----- From: dllahr notifications@github.com Date: Fri, 09 Nov 2012 18:36:47 To: broadinstitute/BARDBARD@noreply.github.com Reply-To: broadinstitute/BARD reply@reply.github.com Subject: Re: [BARD] Unique public IDs for items (#3)
Here are some reasons:
I would like to understand why users want these Identifiers. I would also like to understand why and how it relates to public/private/local instances of BARD.
— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10249698.
Reply to this email directly or view it on GitHub: https://github.com/broadinstitute/BARD/issues/3#issuecomment-10251069
As long as we somehow link back to PubChem we need these identifiers
CID = is a canonical representation of a chemical structure – not an actual physical sample – as such it cannot be directly matched to any data, biological or chemical analytical (e.g. Purity or sample weight). All data in PubChem has a matching SID. You cannot measure the biological property of an CID. A query of CID should expose all associated SIDs and their associated data. If different SIDs associated with an CID have widely disparate biological activity values for the same AID, this could indicate batch to batch, lot-to-lot, or storage, or physical state of sample (solution, in a plate, a vial, or from a different vendor). CIDs assignments take some time and curation to ensure uniqueness when a "new" lot of a substance is registered. PubChem checks to see the new SID maps to an already existing CID before issuing a new one.
SID = is a particular instance of a physical sample and thus a "real" physical compound sample. SID's can be assigned rather rapidly by PubChem curators, since they are unique by definition. Please note different batchs/lots of a substance or "compound" will have different SIDs, but the same batch can have different SIDs if it has been sent to a different site who independently registers it. E.g. A compound is ordered from a commercial vendor (and has an SID). The receiving user can use the vendors SID but more typically it is re-registered and receives a new SID. There is some sense in this, because now this substance is stored under different conditions and by different users so may be handled different and thus may eventually change.
TC
From: jasiedu notifications@github.com<mailto:notifications@github.com> Reply-To: broadinstitute/BARD reply@reply.github.com<mailto:reply@reply.github.com> Date: Friday, November 9, 2012 4:38 PM To: broadinstitute/BARD BARD@noreply.github.com<mailto:BARD@noreply.github.com> Subject: Re: [BARD] Unique public IDs for items (#3)
I would like to understand why users want these Identifiers. I would also like to understand why and how it relates to public/private/local instances of BARD.
— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10249698.
In that case they can use the id's that the system supplies. The CID, AID, SID supplied by pubchem are just numbers (my guess is that they are database ids). I guess the thing i do not understand most is why those numbers should be unique on every instance of BARD. I think we are adding a complexity that we do not need.
The idea is that you can have a local instance of BARD, say inside a pharma company, and so a search against both that and the public BARD and have the merged results returned. If the assay IDs are not unique I'm guessing that search will not be possible.
Josh
On Sat, Nov 10, 2012 at 7:24 AM, jasiedu notifications@github.com wrote:
In that case they can use the id's that the system supplies. The CID, AID, SID supplied by pubchem are just numbers (my guess is that they are database ids). I guess the thing i do not understand most is why those numbers should be unique on every instance of BARD. I think we are adding a complexity that we do not need.
— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10254662.
But if you have private data, why would you publish it to the public BARD? Lets suppose that you do publish that data to the public BARD, why would you want to merge the results to the data in the private BARD (since they are the same)?
This seems to me to be an implementation issue. Perhaps we should first figure out how we are going to implement the public/private/local BARD thing and then come back to this issue?
They wouldn't publish private data to public BARD, but they'd want to have their search results include both public and their private data- the results are not the same. Say that Novartis is running SuperSecret Kinase against their library and they want to compare those results to the MLPCN results for public kinase results, to look for specificity, analogs, etc. So they search for results of some molecules against all assays annotated as kinase targets, and the molecular spreadsheet shows both their internal experiment results and the public experiment results. If the assay IDs were the same the results would not return properly.
Josh
On Sat, Nov 10, 2012 at 9:12 AM, jasiedu notifications@github.com wrote:
But if you have private data, why would you publish it to the public BARD? Lets suppose that you do publish that data to the public BARD, why would you want to merge the results to the data in the private BARD (since they are the same)?
This seems to me to be an implementation issue. Perhaps we should first figure out how we are going to implement the public/private/local BARD thing and then come back to this issue?
— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10255490.
Josh, i see exactly what you are talking about, but this is a display issue more than anything. The agent that aggregates the data could prefix these "IDS" at run time. We do not need a central ID generator to do this.
In that case if you are never publishing you are correct. But some people will publish items from the private system to the public system. Some people will decide that they can publish something after they are "done" with it. Others will do it as part of grant requirements.
On Sat, Nov 10, 2012 at 10:24 AM, jasiedu notifications@github.com wrote:
Josh, i see exactly what you are talking about, but this is a display issue more than anything. The agent that aggregates the data could prefix these "IDS" at run time. We do not need a central ID generator to do this.
— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10256165.
617 714 7868 dlahr@broadinstitute.org
I think we should work out what it means to "publish" before we start talking about these namespace identifiers. Are we talking about the CAP or the warehouse or both? Does this relate to the public/private/local BARD conversation that we started having in the EWG a couple of months back but seems to have been pushed to year 2 or this is entirely something new? In any case can we get some very specific use cases so we can better understand what we are all talking about?
Good call on the use cases.
Publish has 2 meanings
Regarding warehouse and CAP the ID must be the same to user. I guess it doesn't matter what ID is used in any warehouse as long as the user only ever sees the same ID for the same item regardless of whether they are looking at local CAP, public CAP, local web client/thick client, or public web client/thick client.
This is not specifically about local/public instances, but it does affect that. Regardless of that we need to establish how ID's are handles in CAP and the warehouse even in just the public systems.
On Sat, Nov 10, 2012 at 11:22 AM, jasiedu notifications@github.com wrote:
I think we should work out what it means to "publish" before we start talking about these namespace identifiers. Are we talking about the CAP or the warehouse or both? Does this relate to the public/private/local BARD conversation that we started having in the EWG a couple of months back but seems to have been pushed to year 2 or this is entirely something new? In any case can we get some very specific use cases so we can better understand what we are all talking about?
— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10256734.
617 714 7868 dlahr@broadinstitute.org
So let me see if i can summarize one of the use cases that you mentioned.
Say i create an Assay (could be Project, Experiment etc) in a CAP (Lets call it CAP A) CAP A gives me back an ID. I want this ID to be unique across all BARD systems (both CAP's and warehouse's) in the world.
Is that a fair summary?
We should take a look at LSID's and see if it is something we could use, assuming that the "crude" use case above is what our users want.
Yes I would say that is a fair description of the use case. LSID http://en.wikipedia.org/wiki/LSID
looks like it is possible solution.
On Sat, Nov 10, 2012 at 12:24 PM, jasiedu notifications@github.com wrote:
We should take a look at LSID's and see if it is something we could use, assuming that the "crude" use case above is what our users want.
— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10257436.
617 714 7868 dlahr@broadinstitute.org
Looks like a lot of good discussion has taken place. A few comments:
Dave, do we have a time line on this issue? It seems the part that deals with the CAP ID/Warehouse ID sync should happen pretty soon
I must admit, I don't see the level of confusion that this solution is designed to avoid.
And frankly this problem is well understood and handled in the chemistry domain for CIDs, so similar approaches for ADIDs should be acceptable and not have a high risk of confusion.
I see the public ADIDs becoming the de-facto replacement for AIDs over time, but I don't see a need to enforce high levels of uniqueness. We might be well advised to have a storage place for a local assay identifier in the DB which local users can use as an extension from their own previously used system (even it was just an Excel spreadsheet!)
Some points which I hope are helpful:As TC and Simon emphasize, the precedence of CIDs and SIDs is
important to review and consider. Note further that there is a
fundamental difference between SIDs, which are information-free
identifiers, and CIDs, which have a one-to-one correspondence
with (OpenEye) canonical smiles. (You could argue that the
number itself is information free, but two different CIDs must
have different Kekule structures). What class of ID will Assay
Definition ID be? I presume like CID, unique by virtue of a
rigorous canonicalization algorithm for assays.
(If I understood the plan...) A canonicalized assay based
ADID, like CIDs, is not just a one time implementation task, but
an ongoing commitment. What happens when BARD v2.0 designers
want to improve on the algorithm? Maybe a finer grained assay
classification results in two assay definitions (ADs) where
there was one? (This is a problem for CIDs which may keep
PubChem tied to one version of OpenEye, I'm not sure.)
If unique ADs are algorithmically determined, private
instances should be able to merge their assays and search
results automatically.
Like SIDs, PubChem AIDs are information-free. This is at the
root of why BARD is so important. So I certainly agree with the
attention to this discussion.
On 11/11/2012 06:40 PM, Simon wrote:
And frankly this problem is well understood and handled in the
chemistry domain for CIDs, so similar approaches for ADIDs
should be acceptable and not have a high risk of confusion.
--
Jeremy J Yang | Mgr, Systems & Programming | UNM
Translational Informatics Divisionjjyang@salud.unm.edu | http://medicine.unm.edu/informatics/
505.925.7533 | fax:505.925.7625 | mobile:505.350.3256
700 Camino de Salud NE | IDTC 2150 | MSC09 5025 | Albuquerque,
NM 87131 | USA
---"We think in generalities, but we live in detail." - Alfred
North Whitehead
Jacob - I don't have a specific timeline, but as you say it should be soon and it should be in the "beta" release.
Simon - to your 2nd option, instead of having the ID generated centrally we also discussed have the prefix registered centrally. That's done only once when setting up a local BARD instance.
Jeremy - I think algorithmically determined ADID's are a neat idea. We will certainly have to look into that.
It seems we need to have a central server somewhere that doles out these IDs (or prefix). We also need to figure out where this would be hosted. I think though that the immediate thing to fix is to sync the ids between the CAP and the warehouse. I am not fully convinced that a Universal ID is the way to go because I still have not seen a use case where Perhaps what we need is for clients to associate a URL(DNS name, since these are unique) to every resource that they deposit and then we should make sure that our search mechanism allows one to also specify a DNS name together with an ID. This way we would not need to build and maintain a different system for maintaining the uniqueness of the IDS. Just a thought
@jeremyjyang But the CID is not an encoding of the SMILES string - it's a simple sequential number issued by PubChem when a structure is registered. It is guaranteed to represent a unique SMILES because PubChem enforces that during the registration process. How they do it (which version of OpenEye, e.g.) is not relevant to that fact of uniqueness.
In the same way BARD will issue a new ADID for each unique assay registered in the CAP and will have a process to determine that the assay is unique. When the rules for uniqueness are changed we need to specify whether the new rules are retrospective and might eliminate some previous ADIDs. This could be nasty and will require a 'redirection' facility so that people using the old deprecated ADID do not get lost. I'd be interested to know what happens with CIDs in this circumstance?
BARD is complicated by the use of 'local' installs and the ramifications of publication of a previously private assay into the public domain. How about a really simple solution: ensure that all assays registered in a private installation get numbers greater than 10,000,000 by setting the number generator to start there?
The details of PubChem CIDs may be a useful case study ("precedent"
is what I meant to type before not "precedence"). PubChem actually
devised a special "Kekule canonical smiles" which was not an
official OpenEye algorithm, mindful that aromaticity can be
debatable and that aromatic smiles in databases result in import
errors with different tools. Evan B. et al. showed ingenuity as
there can be multiple valid Kekule smiles for an aromatic compound
(the algorithm removes bond order then re-determines so it is
canonical). To my knowledge the rules for PubChem structure
canonicalization have not changed (I will try to verify that), and
that promise, and its consequences, as it may apply to ADIDs, are a
big deal, both in practical and conceptual ways.On 11/12/2012 12:29 PM, Simon wrote:
@jeremyjyang
But the CID is not an encoding of the SMILES string - it's a
simple sequential number issued by PubChem when a structure is
registered. It is guaranteed to represent a unique SMILES
because PubChem enforces that during the registration process.
How they do it (which version of OpenEye, e.g.) is not relevant
to that fact of uniqueness.
In the same way BARD will issue a new ADID for each unique
assay registered in the CAP and will have a process to determine
that the assay is unique. When the rules for
uniqueness are changed we need to specify whether the new rules
are retrospective and might eliminate some previous ADIDs. This
could be nasty and will require a 'redirection' facility so that
people using the old deprecated ADID do not get lost. I'd be
interested to know what happens with CIDs in this circumstance?
BARD is complicated by the use of 'local' installs and the
ramifications of publication of a previously private assay into
the public domain. How about a really simple solution: ensure
that all assays registered in a private installation get numbers
greater than 10,000,000 by setting the number generator to start
there?
—
Reply to this email directly or view
it on GitHub.
--
Jeremy J Yang | Mgr, Systems & Programming | UNM
Translational Informatics Divisionjjyang@salud.unm.edu | http://medicine.unm.edu/informatics/
505.925.7533 | fax:505.925.7625 | mobile:505.350.3256
700 Camino de Salud NE | IDTC 2150 | MSC09 5025 | Albuquerque,
NM 87131 | USA
---"We think in generalities, but we live in detail." - Alfred
North Whitehead
One note about possible use cases: providing identifiers to people wishing to store new, private compounds on their private instance of Bard may be a non-issue. I have a hard time believing that any company with enough money to develop a NCE will be willing to risk their investment by loading that information into our system (imagining that somehow their molecular structure might be exposed). Instead they’ll use their own well-protected chemical registration system to keep track of their molecules until they go public. Undoubtedly we WILL need to assign IDs to compounds that our introduced to our public Bard system, but I think that realistically these are the only novel chemical identifiers we’ll need.
Thanks all for the continued great discussion. I'd like to work towards some concrete items.
Here are the given requirements:
Here are some conclusions I propose based on the discussion:
On Wed, Nov 14, 2012 at 1:34 PM, Ben R. Alexander notifications@github.comwrote:
One note about possible use cases: providing identifiers to people wishing to store new, private compounds on their private instance of Bard may be a non-issue. I have a hard time believing that any company with enough money to develop a NCE will be willing to risk their investment by loading that information into our system (imagining that somehow their molecular structure might be exposed). Instead they’ll use their own well-protected chemical registration system to keep track of their molecules until they go public. Undoubtedly we WILL need to assign IDs to compounds that our introduced to our public Bard system, but I think that realistically these are the only novel chemical identifiers we’ll need.
— Reply to this email directly or view it on GitHubhttps://github.com/broadinstitute/BARD/issues/3#issuecomment-10377854.
617 714 7868 dlahr@broadinstitute.org
On Wed, Nov 14, 2012 at 8:23 PM, dllahr notifications@github.com wrote:
- user uniqueness requirement precludes keeping things entirely local and then getting a synonym when publishing to public BARD
While this seems like the best way given the assumptions above, I fear that we end up with the tangle that is gene names and their synonyms.
coordinate the prefixes
I think this is reasonable
Rajarshi Guha | http://blog.rguha.net NIH Center for Advancing Translational Science
Current agreement:
Here are some conclusions I propose based on the discussion:
original statement
According to users, we must have unique, public IDs available for items in the system: assay definitions experiments projects
This is regardless of whether the item is created in a local instance or a public instance - they must not have the same ID.
example use cases:
Here are 5 possible solutions
Pros and cons of Central:
Pros and cons of Prefix:
Pros and cons of synonyms:
Pros and cons of decentralized