Open jmcmurry opened 7 years ago
Here's another example:
@jmcmurry I had a look at the datamed ingester pipeline for metabolomics workbench and there are clearly errors (hard coded value for the publication elements). When it comes to the identifier, the number shown '58' seems to correspond to the NIH MW Study_ID but it should read ST000058, not '58'
For GTEx it's not clear what's been indexed i.e. there are 1622 "datasets" indexed from GTex. They all have the same title and description. First we need to work out what these represent and what we should consider as a dataset for GTEx.
Exactly, DataMed or any similar platform really has to think carefully about:
For instance, the publication above for 58 (eg ST000058) is not at all related to the specific record, but to the metabolomics workbench as a whole. Thus while the publication is relevant, it would be better placed in the record for the database, than for the record.
Our curation on the PrefixCommons sub has uncovered several types of datasets whose identifiers are not resolvable at their original source. There is nothing DataMed can do about that except perhaps exhort the original sources to make better design decisions.
However, what is tricky is that Datamed is reusing some of the original (local) identifiers, but as far as I can tell, often without any contextualization or ability to find that which is identified. For instance below, how does one access what this record corresponds to? What is 3120 and at what level of granularity? After 30 minutes of searching the native site I still can't work out who issued that ID? where can I find it? If the access point for most GTex data is actually DB gap, it would be best to direct users there and to give the DB gap ID like phs000424.v3.p1 for example. If you're going to use an identifier that is not (uniquely) resolvable, it would be good to have at least a link of where / how to find it.
Below is a markup of the record in biocaddie, some drive-by observations are there too (unrelated to identifiers per se).
Some thoughts on potential approaches identifier surrogacy now here: http://bit.ly/identifiersurrogacy