Use case: external scientist without data access needs results

mih commented 5 years ago

This is a common problem for any data analysis involving personal information. Approach:

Build dataset that implements the same structure (organization, and filenames if possible), but does not contain the actual problematic data (maybe tracked, but not available through annex, but maybe even without any relationship to the actual data, i.e. mock-data,or simulated data)
Provide dataset publicly to aid development of analysis implementations
Clearly describe how this mock differs from the inaccessible other dataset
External users are instructed to create a new dataset (to hold their code) that has the mock dataset as a subdataset
External users submit their dataset, the subdataset is replaced with the real dataset (actual version is tracked), code is executed (after having been reviewed), results are captured in the submitted dataset.
Results are pushed back to the external users (or deposited in an accessible place for them to pull) -- the local data remains local and unavailable

adswa commented 4 years ago

note to self from mihs talk today: "bring the computation to the data"

adswa commented 4 years ago

I have started a draft of this a while ago already in #235, let's actually do this with the studyforrest data at the start of 2020 :)

datalad-handbook / book