dcppc / data-stewards

Questions and answers about TOPmed, GTEx, and AGR resources.
8 stars 0 forks source link

Team Oxygen - ETL of GTEx/TOPMed into DATS. #12

Open owhite opened 6 years ago

owhite commented 6 years ago

Initiating a thread for the DATS files generated by Team Oxygen for GTEx and TOPMed data. Files to be converted to bdbags and hosted at some location. Location to be determined.

aegururaj commented 6 years ago

Link to DATS model : https://github.com/biocaddie/WG3-MetadataSpecifications

Latests version of DATS - DATS v2.2

Test for DATS compliance using test script: https://github.com/biocaddie/WG3-MetadataSpecifications/blob/master/tests/test_dats_model.py

aegururaj commented 6 years ago

Two TOPMed datasets from the publicly available dbGAP metadata mapped to DATS v2.2 are provided. Please note that neither the metadata nor the model are complete yet!!! Limitations: 1) metadata are from publicly accessible dbGAP web site; 2) study variables have not yet been mapped to dimensions; 3) limited data harmonization (mapping to standard vocabulary concepts).

Our plan includes automating the ingestion process and work on more complete mapping of metadata to DATS

aegururaj commented 6 years ago

Apologies...GitHub won't let me upload JSON in their issue section, any workarounds? In the interim, providing links to the files: 1) dats_phs001143 2) dats_phs000954

rpwagner commented 6 years ago

@aegururaj, similar to #2, shouldn't this metadata be incorporated into a BDBag and assigned an identifier? @carlkesselman @ianfoster?

aegururaj commented 6 years ago

@rpwagner Sure, once the model is decided on and finalized. The intent here is to be able to just look at some sample metadata, as I understand. @ianfoster @carlkesselman ?

mikedarcy commented 6 years ago

For laughs, I created a couple of minids with the minid CLI for the JSON files listed above. It's really easy to do so, and is perfectly fine for intermediate versions of files that might become obsolete. In fact, that is one of the features of minids, i.e., you've got a way to create a provenance chain so even if someone references an old minid, there should be a redirect reference to a newer version (if it exists).

Here's the exact commands I used: (Note, to make this work I used the "Shareable Link" from each file and not the web page URL of the Google Drive folder):

minid --register --title "DATS formatted metadata for dbGAP study phs000954.v1.p1" dats_phs000954.json --locations https://drive.google.com/open?id=1RFqR-b8iNRa_V8CuESWA_hw5st4P3fqG
minid --register --title "DATS formatted metadata for dbGAP study phs001143.v1.p1" dats_phs001143.json --locations https://drive.google.com/open?id=13tO-mnLCixyF_EXdNnArlTvj3xTchziZ

The results are minid:b94t3q and minid:b91113, respectively.

I encourage anyone who is interested to give the minid CLI program a try. Its simple to install if you already have Python (and Pip) installed on your system. Just follow the guide here. Make sure to perform the initial user registration step, and then I highly recommend adding your user information into ~/.minid/minid-config.cfg so that you do not have to specify the same arguments on the command line.

bheavner commented 6 years ago

And for more grins, I've been working an R minid tool library. The dev version can do minid lookups now, which for those minids that @mikedarcy just made look like this in an R session:

> devtools::load_all()
Loading minidtools
> config <- load_configuration()
> lookup("minid:b94t3q")
MINID:
  identifier = ark:/57799/b94t3q
  short_identifier = minid:b94t3q
  creator = mdarcy
  orcid = 0000-0003-2280-917X
  created = Fri, 20 Apr 2018 22:22:58 GMT
  checksum = fe1d7fc641ae2befae2b7c2a989019553b22e21cdda7b9d6054617921b821613
  checksum_function = SHA256
  status = ACTIVE
  locations = https://drive.google.com/open?id=1RFqR-b8iNRa_V8CuESWA_hw5st4P3fqG
    (use locations(object) for more)
  titles = DATS formatted metadata for dbGAP study phs000954.v1.p1
    (use titles(object) for more)
  obsoleted_by =  
    (use obsoleted_by(object) for more)
  content_key = 
> lookup("minid:b91113")
MINID:
  identifier = ark:/57799/b91113
  short_identifier = minid:b91113
  creator = mdarcy
  orcid = 0000-0003-2280-917X
  created = Fri, 20 Apr 2018 22:24:05 GMT
  checksum = 5a3581ebe1257a85a747d6f6af647e8c38d24867085152ed7a97ed2a45e31d47
  checksum_function = SHA256
  status = ACTIVE
  locations = https://drive.google.com/open?id=13tO-mnLCixyF_EXdNnArlTvj3xTchziZ
    (use locations(object) for more)
  titles = DATS formatted metadata for dbGAP study phs001143.v1.p1
    (use titles(object) for more)
  obsoleted_by =  
    (use obsoleted_by(object) for more)
  content_key = 

(Note that this tool is much less mature than the python CLI program)