biocaddie / prototype_issues

Used to report and track bioCADDIE prototype issues
3 stars 5 forks source link

Identifier contextualization #291

Open jmcmurry opened 7 years ago

jmcmurry commented 7 years ago

Our curation on the PrefixCommons sub has uncovered several types of datasets whose identifiers are not resolvable at their original source. There is nothing DataMed can do about that except perhaps exhort the original sources to make better design decisions.

However, what is tricky is that Datamed is reusing some of the original (local) identifiers, but as far as I can tell, often without any contextualization or ability to find that which is identified. For instance below, how does one access what this record corresponds to? What is 3120 and at what level of granularity? After 30 minutes of searching the native site I still can't work out who issued that ID? where can I find it? If the access point for most GTex data is actually DB gap, it would be best to direct users there and to give the DB gap ID like phs000424.v3.p1 for example. If you're going to use an identifier that is not (uniquely) resolvable, it would be good to have at least a link of where / how to find it.

Below is a markup of the record in biocaddie, some drive-by observations are there too (unrelated to identifiers per se).

biocaddie_id_ambiguity

Some thoughts on potential approaches identifier surrogacy now here: http://bit.ly/identifiersurrogacy

jmcmurry commented 7 years ago

Here's another example: biocaddie_id_ambiguity2

proccaserra commented 7 years ago

@jmcmurry I had a look at the datamed ingester pipeline for metabolomics workbench and there are clearly errors (hard coded value for the publication elements). When it comes to the identifier, the number shown '58' seems to correspond to the NIH MW Study_ID but it should read ST000058, not '58'

ianfore commented 7 years ago

For GTEx it's not clear what's been indexed i.e. there are 1622 "datasets" indexed from GTex. They all have the same title and description. First we need to work out what these represent and what we should consider as a dataset for GTEx.

jmcmurry commented 7 years ago

Exactly, DataMed or any similar platform really has to think carefully about:

For instance, the publication above for 58 (eg ST000058) is not at all related to the specific record, but to the metabolomics workbench as a whole. Thus while the publication is relevant, it would be better placed in the record for the database, than for the record.