Open mgxd opened 8 years ago
The NDA team are proposing some participants for each issue along with a list of available resources (internal and external), and possible limitations / needs.
Dan Hall @danhall100
I sent you an email on the Dartmouth Signing Officials that can approve access. Unfortunately, we'll need the agreement signed by them for you to access research level data. I opened up Issue #11 providing links to our paper and the form. If there's interest, we'd be happy to have a quick call to review how to get access and how the NDA is organized.
do S3 buckets used by NDA have versioning enabled or where could I read more on the structure of NDA in particular concerning files/datasets versioning?
@yarikoptic in reply to your question, we do not currently have versioning enabled on our buckets. We have discussed doing this in the past, but we have never acted on it. Regarding the structure of NDA files/datasets, would it be helpful to provide a high level model? Would this project's Wiki be the appropriate place for such a thing?
openfmri
one@yarikoptic and @obenshaindw this project could be coupled with efforts in #7 as well.
We don't have a requirement for versioning. Perhaps one will be provided, but for us, we're only interested in raw objects off the machine and analyzed results. Anything in between is up to labs/computational scientists. Look at https://ndar.nih.gov/ndar_data_dictionary.html?short_name=image03 to understand how we associate objects with data. For clinical and imaging data, should we get a new version, we simply archive off the old and release the object. The exception is when an old object/record is used in a study, then we'll keep both but aggregation only accesses the most recent. So that everyone has an understanding of how data is maintained in the archive, we'll provide a couple of webinars prior to June 13 so that everyone has an understanding on how the system is architected. Stay tuned for that. Also, I'm working an SOP at the NIH to provide informatics access for a month.
@danhall100 - informatics access would be nice for people who don't have current access.
as part of the webinar, it would be nice to have a sense of how data are updated (added/changed/fixed) in the db, and how are people notified of such updates.
something to think about only ... raw objects off the machine and analyzed results
:
results (i.e. derived data) can and do change, happen analysis was adjusted or fixed (let me know if you need some prominent examples ;) ). If someone carries out an analysis based on the derived data, change of the derived data would render those results "unreproducible" if original (derived) data is no longer accessible. Depending on how "archive off" is achieved, and either that archive is accessible anyhow, it might still be possible to get a hold of that old data if necessary to reproduce initial results or replicate the study. And now I could hint that "raw objects" are just as raw as proprietary acquisition platform provides (and you know -- those come out buggy at times as well). I guess those might as well be NIfTI files converted from DICOMs... if you ask -- what could possibly go wrong requiring new versions of those "raw" files -- have a look e.g. at https://openfmri.org/dataset-orientation-issues/ and recent publication http://www.ncbi.nlm.nih.gov/pubmed/26945974 outlining the challenges with such conversions.
So even if not 'required' it would be really nice if the system has "supported" versioning of the files at least at the storage level and access to any version of a given file. With versioned S3 buckets it becomes somewhat more straightforward, even if the "system" on your end would still be ignorant of the versioning information -- I could then get specific versions of a given file. See e.g.:
$> datalad ls -a s3://openfmri/tarballs/ds001_release_history.txt
...
tarballs/ds001_release_history.txt 2016-02-18T20:36:45.000Z ver:iFSz.z5.pM0qah9JEr7nNzTFL0RURJDQ ... http://openfmri.s3.amazonaws.com/tarballs/ds001_release_history.txt?versionId=iFSz.z5.pM0qah9JEr7nNzTFL0RURJDQ [OK]
tarballs/ds001_release_history.txt 2014-05-31T02:15:32.000Z ver:null ... http://openfmri.s3.amazonaws.com/tarballs/ds001_release_history.txt?versionId=null [OK]
which shows two versions of the ds001_release_history.txt
file where initial one (versionId=null) was created before versioning was enabled for the bucket.
@danhall100 I wondered how I could now quickly browse some dataset to e.g. figure out s3:// urls. I have introduced basic support of authenticating via https://github.com/NDAR/nda_aws_token_generator into DataLad (https://github.com/datalad/datalad/pull/527) so theoretically should be able to fetch load from NDA's buckets
We gave a webinar on Wednesday that we'll repeat on Monday. In the meantime, the cloud tutorials should provide good background. In summary:
1) Create a package for one of the imaging collections say : https://ndar.nih.gov/edit_collection.html?id=2026
2) Choose Download
3) Create a Package for download - or create a database snapshot - and you'll get the file references to S3 within our image03 table, MD5 values and listing of the files we have in S3.
4) Review these tutorials for specifics. https://ndar.nih.gov/cloud_overview.html
1) Create a package for one of the imaging collections say : https://ndar.nih.gov/edit_collection.html?id=2026 2) Choose Download after clicking Download it just leads to login window with entry for my credentials (although I am already logged in) or "Request Account" (I thought I already have one) button: http://www.onerussian.com/tmp/gkrellShoot_06-10-16_090940.png and if I click "login", brings me back to original page. Browser Firefox 45.1.0.
Rinse/repeated already about 5 times. Please advise.
We have a permissions issue here which we'll resolve. Instead. Select the download button from either https://ndar.nih.gov/data_from_labs.html or https://ndar.nih.gov/ndar_data_dictionary.html?type=Imaging&source=All&category=All
As a summary:
Staging changes (need unittesting etc) for DataLad are in https://github.com/datalad/datalad/pull/527 . They provide
With a little effort I can generate a complete DataLad collection for NDA, but would need extended duration of access to the NDA and streamlined way to generate a complete collection of tables for collections to be considered. We have attempted to create miNDAR package for all subjects having imaging data, but tech difficulties precluded it.
Goal: establish pipelines for representing NDA data as DataLad collections to unify/ease access to it. Also consider support of “publishing” datalad’s collections into NDA
DataLad is supported by NSF ( 1429999) and the German Federal Ministry of Education and Research (BMBF 01GQ1411)
Participants: Yaroslav Halchenko and Ben Poldrack to be done: analysis of storage and access to NDA data and meta-data, establishing pipelines for continuous monitoring and representation of data in NDA as git-annex/datalad handles, collating those into datalad collections. Collaborate with “export to RDF” team, to make use of that information within DataLad’s handles
“to be done” can probably be achieved within 3 day sprint
established pipelines will be maintained later in the pool of the established DataLad’s pipelines