ME-ICA / multi-echo-data-analysis

Still a work in progress.
https://me-ica.github.io/multi-echo-data-analysis/
GNU Lesser General Public License v2.1
3 stars 2 forks source link

DataLad for multi-echo data access #13

Open jsheunis opened 2 years ago

jsheunis commented 2 years ago

What do you think about using DataLad to streamline data access for publicly available ME datasets? It looks like all of the datasets used in the book that don't require a data use agreement are on OpenNeuro, i.e. they are already DataLad datasets. It will be easy to include those as subdatasets into a multi-echo "super dataset" that people can clone and then download individual subdatasets or files selectively.

Of course, we don't have to make DataLad a requirement for people working with the book's tutorials, so this could also just be an alternative for those who have datalad installed.

Additionally, if some tutorials can be run on Binder, we have this ready-made config for running datalad on binder: https://github.com/datalad/datalad-binder

tsalo commented 2 years ago

The problem with the existing OpenNeuro datasets is that most don't have the echo-wise preprocessed data we need for our examples. We thought of just fMRIPrepping the open datasets ourselves and uploading the derivatives to OpenNeuro in separate "datasets" linking to the original ones, but OpenNeuro doesn't currently support uploading derivatives-only datasets (see https://github.com/OpenNeuroOrg/openneuro/issues/2436), so I don't know if we can directly use OpenNeuro for most of our planned examples. Currently, we're looking at uploading fMRIPrep derivatives to the OSF and using a fetcher to grab them from there. Is there a storage alternative that would be more compatible with DataLad?

tsalo commented 2 years ago

Chris actually mentioned G-Node in that issue, which I had forgotten. Would that be a good alternative?

I think we looked at it but decided against it for tedana's datasets module (see https://github.com/ME-ICA/tedana/issues/684) because it would require a new dependency and no one was familiar with it.

jsheunis commented 2 years ago

Yup GIN is a good option for public and free hosting of data (up to a number of terabytes per account/repo iirc). And it works well with standard DataLad functionality. See here for a walkthrough of how to publish/connect a DataLad dataset to GIN: https://handbook.datalad.org/en/latest/basics/101-139-gin.html

DataLad also has an extension for integrating with OSF, http://docs.datalad.org/projects/osf/en/latest/, so that's also a possibility.

I guess it depends on which dependencies are fine to include (if any at all) for which packages (tedana as a whole, vs only for the jupyter book). Looking at https://github.com/ME-ICA/tedana/issues/684, DataLad can do all of that quite well, although I can understand hesitation before including new dependencies (for DataLad: mainly datalad, git and git-annex), vs building a light-weight module that does something specific with well-defined boundary conditions.

Either way, if DataLad is an alternative for getting data used in the book, I can see the superdataset having a structure like this:

public-multi-echo-data
├── raw
│   ├── ds1
│   ├── ds2
│   ...
│   └── dsN
├── derivatives
│   ├── ds1_deriv
│   ├── ds2_deriv
│   ...
│   └── dsN_deriv
 ...
└── README

where all raw or derivative datasets would essentially be submodules that symlink to these respective datasets, which are in turn either hosted on OpenNeuro (i.e. the raw datasets) or, for example, on GIN (i.e. derivative datasets). Having all of these structured as a hierarchy of nested datalad datasets makes it very easy for datalad to give users access to any specific (sub)datasets and/or files.

jsheunis commented 2 years ago

Here's v1 of the super-dataset, currently containing only raw subdatasets that are hosted on OpenNeuro: https://github.com/jsheunis/multi-echo-super

jsheunis commented 2 years ago

The multi-echo-super dataset now has all open multi-echo datasets from OpenNeuro included (as far as I'm aware) and also the fmriprep processed data of the Multi-echo Cambridge dataset that's on OSF (see this comment)

@notZaki, did you use the OSF API to get file paths and urls in order to build the manifest.json file? If so, do you still have a script lying around? The manifest file was very useful in order to create a datalad dataset linking to the file storage on OSF. I want to do the same for the masking test dataset on OSF, which doesn't currently have a manifest.

notZaki commented 2 years ago

@jsheunis Here's a link to the manifest fie for the masking test dataset: manifest.json (might not last forever)

I made this julia package to make the json file. There is an example on the readme on how to produce such files. Alternatively, the osfclient package for python might also be able to do something similar, but I haven't used it.

jsheunis commented 2 years ago

Oh, that's perfect, thanks @notZaki !

jsheunis commented 2 years ago

And thanks for the pointers to your julia package and osfclient 👍

notZaki commented 2 years ago

@emdupre has also made csv files for fetching data, but I don't remember how that was done.

emdupre commented 2 years ago

I had just grabbed them with Python requests; here's a short gist demonstrating the idea.

That really works best for flat directory structures, but for more nested ones you'll have to add another loop ! At some point I tried osfclient, but that might have been between OSF API versions, so IIRC it wasn't yet updated. I haven't tried more recently, though !

jsheunis commented 2 years ago

Thanks! I'll update here in case I try the recent osfclient.

tsalo commented 1 year ago

Is there a good way to use the datalad Python tool or repo2data to grab only a single folder from a G-Node GIN or datalad dataset? I think installing the whole dataset would take too long in some cases (e.g., with the Cambridge and Le Petit Prince fMRIPrep derivatives).

jsheunis commented 1 year ago

@tsalo Just to be sure we're talking about the same things, with "grab only a single folder" do you refer to retrieving actual file content, or just getting the file tree (from git)? And with "installing a whole dataset" do you mean install in the datalad sense (where the git repo is cloned, but file content is not (yet) retrieved), or do you mean retrieving all data locally?

With datalad you can clone (a.k.a. install) the whole dataset easily, e.g. :

$ datalad clone https://github.com/jsheunis/multi-echo-cambridge-fmriprep.git

This clones the dataset's git repo and some datalad config files, but no file content. It takes a few seconds. And then you can get (and drop) specific file content on demand, e.g. all files within a directory at a specified relative path:

$ cd multi-echo-cambridge-fmriprep
$ datalad get datalad get sub-20847/figures/*

get(ok): sub-20847/figures/sub-20847_task-rest_desc-rois_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-carpetplot_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-summary_T1w.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_space-MNI152NLin2009cAsym_T1w.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-summary_bold.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-confoundcorr_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-conform_T1w.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-validation_bold.html (file) [from web...]
get(ok): sub-20847/figures/sub-20847_task-rest_desc-compcorvar_bold.svg (file) [from web...]
get(ok): sub-20847/figures/sub-20847_desc-about_T1w.html (file) [from web...]
  [2 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  get (ok: 12)
tsalo commented 1 year ago

Sorry for the confusion.

Just to be sure we're talking about the same things, with "grab only a single folder" do you refer to retrieving actual file content, or just getting the file tree (from git)?

I'm referring to just getting the file tree.

And with "installing a whole dataset" do you mean install in the datalad sense (where the git repo is cloned, but file content is not (yet) retrieved), or do you mean retrieving all data locally?

I'm referring to installing in the datalad sense.

With datalad you can clone (a.k.a. install) the whole dataset easily

My concern is that datalad clone https://gin.g-node.org/ME-ICA/ds003643-fmriprep-derivatives took several hours to clone the Le Petit Prince fMRIPrep derivatives on my laptop, so I'm worried that running that on each build of the Jupyter Book would be an issue. I was hoping there might be a way to limit it to just a single subject's data.

Maybe more is indexed with git (vs. git-annex) on G-Node GIN by default, but it seemed like most non-nifti files were downloaded in the clone step.

jsheunis commented 1 year ago

Thanks for clarifying, and for the link to the repo. It looks like the dataset has too many files in git vs git-annex. If you used datalad to create the dataset, the way you can control this is via configurations: https://handbook.datalad.org/en/latest/basics/101-122-config.html

A way you can amend the dataset such that files are moved from git to git-annex (and removed from the git history) is described here: http://handbook.datalad.org/en/latest/beyond_basics/101-162-springcleaning.html#getting-contents-out-of-git. It involves:

This handbook chapter also describes other ways to keep dataset size small, e.g. using subdatasets per subject: http://handbook.datalad.org/en/latest/beyond_basics/101-161-biganalyses.html#calculate-in-greater-numbers

tsalo commented 1 year ago

Ohhhh thanks! I'll try modifying the dataset. That will make using it way easier!

Do you have a recommendation for downloading the data for this book? Should we use datalad to clone the dataset and install one subject's data in a separate script (e.g., the download_data chapter), or can we use repo2data for this?

jsheunis commented 1 year ago

Do you mean when downloading data for the book during the building process? I would say datalad is a good option, yes, if we do have all datasets available as datalad datasets (that was what I intended when creating this issue), and if the infrastructure that we're running the building process or the notebooks on had the requirements for datalad installed. I see there's a github action workflow using ubuntu to build the book, so it will be easy to add steps for installing git annex and datalad.

It looks like all the publicly available datasets listed in the book are already included in the multi-echo-super dataset here: https://github.com/jsheunis/multi-echo-super/tree/main/raw, and the derivatives are added as they are made available, so I think datalad should work.

The way to access individual subjects' files of specific datasets would then be:

datalad clone https://github.com/jsheunis/multi-echo-super # clones the superdataset, which is aware of its linked subdatasets, but these aren't cloned yet

# let's say we're interested in EuskalIBUR
cd multi-echo-super
datalad get --no-data raw/EuskalIBUR # this clones the subdataset at the provided path relative to the superdataset, but doesn't retrieve data content

# let's say we're interested in all data of "sub-001/ses-01"
cd raw/EuskalIBUR
datalad get sub-001/ses-01/*

# or if we want a very specific file
datalad get sub-001/ses-01/anat/sub-001_ses-01_T2w.nii.qz