SIESTA-eu / wp15

work package 15, use case 2
0 stars 0 forks source link

use case 2.3 - multimodal dataset #4

Open robertoostenveld opened 2 months ago

robertoostenveld commented 2 months ago

The National Institute of Mental Health (NIMH) Intramural Healthy Volunteer Dataset

The data set includes a large amount of tabular data characterising human participants, structural MRI, and MEG data.

Results to be produced are of the tabular type, computing xxx on the tabular data, and xxx from the brain imaging data. XXX

Relies on matlab/FieldTrip - the dataset includes 157 participants (357GB)

robertoostenveld commented 1 month ago

Following discussion with @schoffelen and @marcelzwiers we decided to change this use case to another multimodal dataset also available from openneuro: https://doi.org/10.18112/openneuro.ds000117.v1.0.6

This is a better known dataset for which it will take us less time to implement some interesting pipeline, as it is used in the SPM manual and quite a few of the publications in this special issue.

robertoostenveld commented 1 month ago

@schoffelen given the failure so far to make a MATLAB based pipeline in #5, would it make sense to combine this dataset with the pipeline at https://github.com/robertoostenveld/Wakeman-and-Henson-2015?

My pipeline on this data (*) predates the BIDS version of the dataset and the subject specific detail files are not strictly needed any more, as BIDS enforces consistency. In any case it will need some updating of the pipeline code.

*) I made this for the 2016 BIOMAG, at which we also started discussing BIDS-MEG

schoffelen commented 1 month ago

Yes, this, or we could consider the code that we wrote for practicalEEG in 2019: https://www.fieldtriptoolbox.org/workshop/paris2019/

schoffelen commented 1 month ago

Note for reference here: datalad does not work on a CentOS node, because the git version is too old. For development of the pipeline at the DCCN, it should be done on a slurm node.

Also note (JM will update this): the instructions in the README.md should also specify the installation of git-annex:

datalad-installer git-annex -m datalad/git-annex:release --install-dir venv

subsequent moving of the git-annex specific stuff from venv/usr/lib and venv/usr/bin into venv/lib and venv/bin (or otherwise export the PATH so that the git-annex can be found later on).

robertoostenveld commented 1 month ago

In the READMEs we should document the (minimum) version of system-wide dependencies like git, as was also detected by @marcelzwiers in #11 for node.js.

I already had used datalad in the past on my MacBook, which is probably why the dependencies were already satisfied for me.

robertoostenveld commented 1 month ago

But some dependencies are not trivial to detect, especially when it "just works" for the first person to implement (part of) the pipeline.

robertoostenveld commented 1 month ago

The paris2019 pipelines are already documented in quite some detail - and of course implemented for the BIDS version. Using them would probably also make the use case the heaviest regarding compute requirements (it is already in terms of data size). Is it desirable to have such a big computational load? It will make running it repeatedly by different people (which I would expect to be required) not so attractive.

Looking at the sequence, I think that the frequency analysis part (day one, afternoon) can be skipped, as the group source stats does not depend on it. I am not sure whether there is a nice group-level end-point of the analysis. I recall that the group statistics were a bit disappointing and that we did not further optimize the processing over participants. Perhaps we may need to consider changing the last step (group source stats).

But returning to the computational load: should we not first get something lighter in place and start discussing with SIESTA partners based on that, rather than making it too ambitious?

robertoostenveld commented 1 month ago

@schoffelen could you give it a try with the FT version in https://github.com/fieldtrip/fieldtrip/pull/2416 ?

schoffelen commented 1 month ago

Yes, will do. An initial try seems to be running through find. I'll stop for now, and deploy it once I leave the office (and just for the fun of it do all subjects, rather than 10:16. For now it's too annoying that each time a figure is created my other MATLAB session is interrupted (and the figure pops up)

robertoostenveld commented 1 month ago

Oh, silly mistake with the 10:16 subset 🤦. Please fix it, and other stuff you might find.

The figures are indeed annoying. For me the whole computation for all 16 participants was done in some 4 hours or so, but that was with only a single run from the 6. You may also want to update this line

https://github.com/SIESTA-eu/wp15/blob/7df31b7793f89a115ce5515239265af1183ace58/usecase-2.3/analyze_all_subjects.m#L7

to have it execute all runs. Do we have some quota on the HPC cluster that we could use to execute this? I have an old project (3011231.02) that is still active and that has 650GB of storage that we could use as scratch space. I will add you to it as collaborator.

schoffelen commented 1 month ago

Marcel and I have some storage space on our staff scientists' project number, I think that the amount of space there is also on the order of 600 GB or so.