DataLad - Githubissues

mgxd commented 8 years ago

Goal: establish pipelines for representing NDA data as DataLad collections to unify/ease access to it. Also consider support of “publishing” datalad’s collections into NDA

DataLad is supported by NSF ( 1429999) and the German Federal Ministry of Education and Research (BMBF 01GQ1411)

Participants: Yaroslav Halchenko and Ben Poldrack to be done: analysis of storage and access to NDA data and meta-data, establishing pipelines for continuous monitoring and representation of data in NDA as git-annex/datalad handles, collating those into datalad collections. Collaborate with “export to RDF” team, to make use of that information within DataLad’s handles

“to be done” can probably be achieved within 3 day sprint

established pipelines will be maintained later in the pool of the established DataLad’s pipelines

obenshaindw commented 8 years ago

The NDA team are proposing some participants for each issue along with a list of available resources (internal and external), and possible limitations / needs.

NDA Point Person(s)

Dan Hall @danhall100

Available Resources

Prototypical life cycle of a hypothetical DataLad data distribution
NDA is currently in the requirements gathering stage for developing additional web services (authenticated and unauthenticated), which might facilitate this sort of effort (e.g. meta data by subject/age).
NDA is looking to employ incommon.org authentication for data access.
Authenticated Web Service for S3 Tokens

Comments / Limitations / Needs

Need to know what metadata DataLad crawler would need, and how it would access it.
Could users create their own DataLad repo from an NDA package, like we do for miNDAR, which we could provide the machinery for?
How do pull/push requests work in a scheme where data files live in cloud object storage?
The NIH is moving away from download copy of imaging/omics. Does DataLad have plans to support an access/expunge model?

yarikoptic commented 8 years ago

what is the optimal way for me at Dartmouth to seek access to NDA for the purpose of this sprint/project?
do S3 buckets used by NDA have versioning enabled or where could I read more on the structure of NDA in particular concerning files/datasets versioning?

danhall100 commented 8 years ago

I sent you an email on the Dartmouth Signing Officials that can approve access. Unfortunately, we'll need the agreement signed by them for you to access research level data. I opened up Issue #11 providing links to our paper and the form. If there's interest, we'd be happy to have a quick call to review how to get access and how the NDA is organized.

obenshaindw commented 8 years ago

do S3 buckets used by NDA have versioning enabled or where could I read more on the structure of NDA in particular concerning files/datasets versioning?

@yarikoptic in reply to your question, we do not currently have versioning enabled on our buckets. We have discussed doing this in the past, but we have never acted on it. Regarding the structure of NDA files/datasets, would it be helpful to provide a high level model? Would this project's Wiki be the appropriate place for such a thing?

yarikoptic commented 8 years ago

versioning -- sad. How do you version your data then (unique filenames?) if at all? if you need an example of a bucket with versioning enabled (not from the beginning but from some later point in time), look at the openfmri one
an overview of the structure would be great! especially in how it relates to layout on S3 buckets, especially if versioning is somehow available. I can digest anything (pdf, html, wiki) to start with, and later we will talk about optimal way to fetch current state of datasets/files via some API or web scraping

satra commented 8 years ago

@yarikoptic and @obenshaindw this project could be coupled with efforts in #7 as well.

danhall100 commented 8 years ago

We don't have a requirement for versioning. Perhaps one will be provided, but for us, we're only interested in raw objects off the machine and analyzed results. Anything in between is up to labs/computational scientists. Look at https://ndar.nih.gov/ndar_data_dictionary.html?short_name=image03 to understand how we associate objects with data. For clinical and imaging data, should we get a new version, we simply archive off the old and release the object. The exception is when an old object/record is used in a study, then we'll keep both but aggregation only accesses the most recent. So that everyone has an understanding of how data is maintained in the archive, we'll provide a couple of webinars prior to June 13 so that everyone has an understanding on how the system is architected. Stay tuned for that. Also, I'm working an SOP at the NIH to provide informatics access for a month.

satra commented 8 years ago

@danhall100 - informatics access would be nice for people who don't have current access.

as part of the webinar, it would be nice to have a sense of how data are updated (added/changed/fixed) in the db, and how are people notified of such updates.

yarikoptic commented 8 years ago

something to think about only ... raw objects off the machine and analyzed results: results (i.e. derived data) can and do change, happen analysis was adjusted or fixed (let me know if you need some prominent examples ;) ). If someone carries out an analysis based on the derived data, change of the derived data would render those results "unreproducible" if original (derived) data is no longer accessible. Depending on how "archive off" is achieved, and either that archive is accessible anyhow, it might still be possible to get a hold of that old data if necessary to reproduce initial results or replicate the study. And now I could hint that "raw objects" are just as raw as proprietary acquisition platform provides (and you know -- those come out buggy at times as well). I guess those might as well be NIfTI files converted from DICOMs... if you ask -- what could possibly go wrong requiring new versions of those "raw" files -- have a look e.g. at https://openfmri.org/dataset-orientation-issues/ and recent publication http://www.ncbi.nlm.nih.gov/pubmed/26945974 outlining the challenges with such conversions. So even if not 'required' it would be really nice if the system has "supported" versioning of the files at least at the storage level and access to any version of a given file. With versioned S3 buckets it becomes somewhat more straightforward, even if the "system" on your end would still be ignorant of the versioning information -- I could then get specific versions of a given file. See e.g.:

$> datalad ls -a s3://openfmri/tarballs/ds001_release_history.txt
...
tarballs/ds001_release_history.txt 2016-02-18T20:36:45.000Z ver:iFSz.z5.pM0qah9JEr7nNzTFL0RURJDQ  ... http://openfmri.s3.amazonaws.com/tarballs/ds001_release_history.txt?versionId=iFSz.z5.pM0qah9JEr7nNzTFL0RURJDQ [OK]
tarballs/ds001_release_history.txt 2014-05-31T02:15:32.000Z ver:null                              ... http://openfmri.s3.amazonaws.com/tarballs/ds001_release_history.txt?versionId=null [OK]

which shows two versions of the ds001_release_history.txt file where initial one (versionId=null) was created before versioning was enabled for the bucket.

yarikoptic commented 8 years ago

@danhall100 I wondered how I could now quickly browse some dataset to e.g. figure out s3:// urls. I have introduced basic support of authenticating via https://github.com/NDAR/nda_aws_token_generator into DataLad (https://github.com/datalad/datalad/pull/527) so theoretically should be able to fetch load from NDA's buckets

danhall100 commented 8 years ago

We gave a webinar on Wednesday that we'll repeat on Monday. In the meantime, the cloud tutorials should provide good background. In summary: 1) Create a package for one of the imaging collections say : https://ndar.nih.gov/edit_collection.html?id=2026 2) Choose Download 3) Create a Package for download - or create a database snapshot - and you'll get the file references to S3 within our image03 table, MD5 values and listing of the files we have in S3.
4) Review these tutorials for specifics. https://ndar.nih.gov/cloud_overview.html

yarikoptic commented 8 years ago

1) Create a package for one of the imaging collections say : https://ndar.nih.gov/edit_collection.html?id=2026 2) Choose Download after clicking Download it just leads to login window with entry for my credentials (although I am already logged in) or "Request Account" (I thought I already have one) button: http://www.onerussian.com/tmp/gkrellShoot_06-10-16_090940.png and if I click "login", brings me back to original page. Browser Firefox 45.1.0.
Rinse/repeated already about 5 times. Please advise.

danhall100 commented 8 years ago

We have a permissions issue here which we'll resolve. Instead. Select the download button from either https://ndar.nih.gov/data_from_labs.html or https://ndar.nih.gov/ndar_data_dictionary.html?type=Imaging&source=All&category=All

yarikoptic commented 8 years ago

As a summary:

Staging changes (need unittesting etc) for DataLad are in https://github.com/datalad/datalad/pull/527 . They provide

Access to NDAR buckets with S3 credentials being automagically generated for users based on their user/password credentials
Basic pipeline for construction of DataLad datasets (git-annexes) for NDA collections Sample DataLad datasets were created (nda* ones under http://datasets.datalad.org/test) and basic functionality was demonstrated

With a little effort I can generate a complete DataLad collection for NDA, but would need extended duration of access to the NDA and streamlined way to generate a complete collection of tables for collections to be considered. We have attempted to create miNDAR package for all subjects having imaging data, but tech difficulties precluded it.

BANDA-connect / NDA-sprint

DataLad #3

NDA Point Person(s)

Available Resources

Comments / Limitations / Needs