dcppc / data-stewards

Questions and answers about TOPmed, GTEx, and AGR resources.
8 stars 0 forks source link

MODs Identifier dump #2

Open jmcmurry opened 6 years ago

jmcmurry commented 6 years ago

We would like a dump of all of each of the MOD identifiers in this format. Note that this includes any internal IDs which may or may not be resolvable. This will require going beyond the information provided so far by the Alliance (the identifier documentation as well as the manifest recently uploaded to the Amazon cloud).

sierra-moxon commented 6 years ago

Hi Julie - It would be super helpful to have (even informal) use cases for these kinds of requests? It would help (at least ZFIN) fulfill the request and possibly initiate a discussion about #6 in this repo.
thanks a bunch! Sierra

owhite commented 6 years ago

I'm gonna reboot this conversation if possible.

Sierra - I think the purpose is for us to simply explore the range of IDs that are hosted at the MODs, for the eventual purpose of elements such as external cross references to other resources, pointers to annotation elements to things like GO ids or EC numbers, and to determine if the name-space/usage of those items is consistent across the MODs. The other purpose - I believe - is do look at the internal content - identifiers linked to organisms, strains, variants, genes for a similar purpose - we are wondering about how the practices for IDs are handled across the groups.

Does that help?

jmcherry-zz commented 6 years ago

Doesn’t make sense to provide internal IDs. They are internal for a reason, we don’t share them because we don’t want users to use them.

There was no response on this was the question didn’t make sense to us.

jmcmurry commented 6 years ago

It sounds to me like the concepts of public and resolvable might be being conflated? Just to break it down...

This task can be scoped to just the IDs that appear in the data dumps. Any ID that is so deeply internal that doesn't make it even as far as the dumps can be safely ignored for the purpose of this task. However, any ID that DOES makes it to the dumps should be described in such a way that the ID is not abused by others using the dumps (eg. mistaken for durable when it is not; mistaken for resolvable when it is not, mistaken for the same ID when it is different, or mistaken for different when it is the same).

. Publicly Resolvable Not publicly resolvable
Appears in datadumps high value potentially valuable if durable
Doesn't appear in datadumps not a thing no one cares

Does this make sense?

jmcherry-zz commented 6 years ago

Julie,

Thanks, that clears up the question. It will now be straightforward to provide you our IDs that are on webpages or in dump files. MOD policies say these IDs should all be resolvable. I'll pass on this refined request.

So the URL for the sheet is easier to find, copying it here again:

https://docs.google.com/spreadsheets/d/1orgx-657PUQE0qBpFRPEbsKDDynaLfke-UxcEVR_pxA/edit#gid=0

jmcmurry commented 6 years ago

For some background on the thorny edge case of public but stable, public but not resolvable, etc. and why we (monarch) care about them, feel free to look here. The issue is very old and unresolved. Some of the comments now obsolete / overtaken by events, but the principles are still the same.

jmcmurry commented 6 years ago

A while ago, I also wrote up a summary here of what the identifier surrogacy options are for integrators. Happy to have feedback on it.

khowe commented 6 years ago

@jmcmurry Just to clarify: you would like each MOD to produce a file containing all of their (dump-containing) IDs, in the format described by the spreadsheet. Correct?

Would you like us to host these files so that they can picked up? Or should we deposit them somewhere central?

jmcmurry commented 6 years ago

Yes; please deposit in the cloud somewhere and send the link; thanks :)

JoelRichardson commented 6 years ago

Just to get started, I created a google doc that is a copy of Julie's. https://docs.google.com/spreadsheets/d/1A5-_doKRTdepELYTZoCVUTVgyAOK2enRaBhZxoNks1w/edit?usp=sharing

JoelRichardson commented 6 years ago

P.S. Yes, it's editable

carlkesselman commented 6 years ago

No, please I would very much like to suggest we use a BDBag for this, not a spreadsheet. We can help you with this.

A BDBag will have proper manifest, checksums, can use tooling to retrieve, contain metadata. We have already seen in the initial instance where using a bag was able to increase the FAIRness of the data exchange.

THanks,

Carl


Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering

Professor, Preventive Medicine Keck School of Medicine

University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: carl@isi.edumailto:carl@isi.edu Web: http://www.isi.edu/~carl

On Apr 19, 2018, at 7:17 AM, Kevin Howe notifications@github.com<mailto:notifications@github.com> wrote:

@jmcmurryhttps://github.com/jmcmurry Just to clarify: you would like each MOD to produce a file containing all of their (dump-containing) IDs, in the format described by the spreadsheet. Correct?

Where would you like us to host these files so that they can picked up? Or should we deposit them somewhere central?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/dcppc/data-stewards/issues/2#issuecomment-382753774, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADbjXraGbX8scJWoONjVeKCHjvPLjSV4ks5tqJyNgaJpZM4SktyW.

JoelRichardson commented 6 years ago

Since I don't know anything about BDBags, I can't comment or assist. In any case...

I'm still confused as to the scope of this request, even with the qualifiers "publicly resolvable" and "in a dump file". Which dump file? Any dump file? ANYthing that's resolvable at MGI by any public identifier (MGI: or external). I'm guessing there would be well over 100 lines for MGI. Or am I missing something?

sierra-moxon commented 6 years ago

Hi @jmcmurry - Building on what @JoelRichardson said, do you have a id resolving map for all possible ID prefixes anywhere yet? ie: it would probably be good if we provided cross references to the same resource with the same prefix. Ie: NCBI_Gene vs. Gene. There are many cases not covered by the file at GO nor the Alliance one that we've started, if we include ontological cross references (which many of the MODs store and would fall in this generic request without clarification?) thx again, Sierra

sierra-moxon commented 6 years ago

@jmcmurry - re: the doc with ZIRC as an example - another option for your document, might be to use ZFIN (in this example, possibly other MODs as well) as the id resolver for these biological materials. As I understand it, since ZL#'s represent biological material that can be discontinued, they aren't good ids to use in perpetuity. Many resource centers are like this as you point out. ZFIN however, stores the representative content of these materials and could act as a "permanent" resource.

carlkesselman commented 6 years ago

Ok, so this is just a single file with the IRIs for the terms? If we want to include that term list with other data, or the term list is in more then one file then you will like to have them in a well defined aggregate. If it is just a single file, then what I would suggest is that we identify someplace to store it (AWS S2?) we mint an identifier for it (we can do that) and use that to reference this dump.

Thanks,

Carl


Dr. Carl Kesselman Dean’s Professor, Epstein Department of Industrial and Systems Engineering Fellow, Information Sciences Institute Viterbi School of Engineering

Professor, Preventive Medicine Keck School of Medicine

University of Southern California 4676 Admiralty Way, Suite 1001, Marina del Rey, CA 90292-6695 Phone: +1 (310) 448-9338 Email: carl@isi.edumailto:carl@isi.edu Web: http://www.isi.edu/~carl

On Apr 19, 2018, at 8:01 AM, Joel Richardson notifications@github.com<mailto:notifications@github.com> wrote:

Since I don't know anything about BDBags, I can't comment or assist. In any case...

I'm still confused as to the scope of this request, even with the qualifiers "publicly resolvable" and "in a dump file". Which dump file? Any dump file? ANYthing that's resolvable at MGI by any public identifier (MGI: or external). I'm guessing there would be well over 100 lines for MGI. Or am I missing something?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/dcppc/data-stewards/issues/2#issuecomment-382769426, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ADbjXquarabW3SzTxZyPNhAjWHlaHgq4ks5tqKbEgaJpZM4SktyW.

sierra-moxon commented 6 years ago

Would you like this file regularly, or is this a one time survey? Do DOIs, ISSN numbers, ORC ids count? How about ontology xrefs? ontology ids? Just ids that the MOD mints itself and are distributed? If MOD mints an id to provide an internal reference to an ontology or xref, do you want those? Biolink column is optional (would a SO Term id work)? Is this all of our ids, or just a representative sample filled into that spreadsheet? Are there any times already scheduled that we could hash this out on a phone call? thanks a bunch, @jmcmurry @carlkesselman

khowe commented 6 years ago

I (at least) would benefit from some additional guidance on the scope. I will use a specific example: in WormBase, gene records have primary ids of the form WBGene\d+ (e.g. WBGene00006763). These appear in our dumps, and are resolvable (kind of; see below). However, there are a bunch of other identifiers associated with a gene: symbol, systematic name, previous names etc (e.g. "unc-26", "JC8.10", "CELE_JC8.10"). We conceptually treat these as properties of the gene records, rather than identifiers; and in general, they are not resolvable (although they appear in our dumps, and are searchable; and if the search results in one clear unambiguous hit, a redirect to the entity page results ).

Is it correct that for the sake of this exercise, all of these should be considered as "identifiers", and included in the file? In that case, how should non-resolvable ids be represented in the file?

For WormBase, there is a further complication that our primary entity ids are generally resolvable only on a class-by-class basis. This is due to a decision taken early in the life of the project to re-use identifiers between classes (e.g. there are two objects identified as "JC8.10a", one being a Transcript and the other being the CDS of that transcript).

In our interactions with GO, we have addressed this by treating WormBase as a collection of resources, with each class/ data-type having its own prefix (e.g. WB for genes, WB_REF for publications, WBls for life-stage ontology terms). However, prefixes have only been assigned for data types that pop up in our exchanges with GO. Since this exercise requires us to be comprehensive and consider all data types, we will need to generate more prefixes. Would you advise we do this unilaterally? Or is there a third-party central agency that we should work with to do that?

cmungall commented 6 years ago

I would consider symbols and names as disjoint from identifiers, but MMV.

I suggest that as the Alliance is already using the GO prefix registry that this is extended for other types too. I can work with KC2 to ensure this is propagates to identifiers.org / n2t.net (we're already doing this, e.g. ensuring the TAIR records are in sync ).

gabinkley commented 6 years ago

Are the examples of SGD identifiers in this spreadsheet what is requested?

[https://drive.google.com/open?id=1o54ZlW0fkIqOP8gnLtbXsEbapkAsiZ_o]

Just want to be sure I am on the right track before generating an enormous file.

jmcmurry commented 6 years ago

Great question. The relationships themselves (annotation x or y on gen ) is not needed at this point as that would admittedly be both onerous and noisy. Not only that, but the worst way to retrieve this info :)

High priority:

Extremely low priority / ignore:


** For literature, a pair of IDs per article is fine. The native node ID eg. https://www.yeastgenome.org/reference/S000207820 And an xref'd equivalent eg. http://dx.doi.org/10.1126/sciadv.aaq0236 [one of PMID, DOI, PMCID] If the dump can couple the native ID and the xref equivalent great, but not absolutely essential. Capturing the relationships is not in scope for this activity.

Not every single data steward is going to have a perfectly complete set of literature mappings to PMID and DOI and PMCID, so these may need to be retrieved on the fly as required by use cases.

gabinkley commented 6 years ago

Thanks for the feedback and clarification. @jmcmurry

jdepons commented 6 years ago

I took a shot at documenting all the different identifiers returned in calls to our API or files on the FTP site. It's currently in a Google doc but is there another location I should submit it?

https://docs.google.com/spreadsheets/d/11VIdKEG2JPDNHmdoeK2AZ8KM2Kg5QuKlvzJ84HvhoEE/edit#gid=0

ctb commented 6 years ago

(a reference here is already better than we're doing usually, so +1 for starting with this :)

sierra-moxon commented 6 years ago

@ctb @jmcmurry - does what @jdepons and @gabinkley provided fulfill this request?

gabinkley commented 6 years ago

I've updated my list of example identifiers and URLs. I removed the links to annotations and added links to external resources that are equivalent to SGD's identifiers that @jmcmurry indicated was desired. Please see updated spreadsheet below:

https://docs.google.com/spreadsheets/d/1FtrS-ATOZdvcE3Bjhv8KYakHZexoL9JzalHIe0TQElE/edit?usp=sharing

A final question that has been asked before, but hasn't been answered directly. Is the request for a file of all identifiers for any example in the spreadsheet or is the just list of examples sufficient right now?

sierra-moxon commented 6 years ago

Just an update on our meeting today with this as a topic: we agreed to make spreadsheets (and post them here) for each of our MODs with representative ids that we mint at the MOD and are publicly available. The content of the spreadsheet (past this general idea), is up to the MOD. Some will have cross references, some will not. If you need something further, @jmcmurry, please let us know.

jmcmurry commented 6 years ago

Thanks Sierra, that sounds like a great start. Cross-references are encouraged but optional provided the xrefs can be derived from the raw data in other ways; if that isn't the case, please just let me know and we can revisit later.