SIESTA-eu / wp15

work package 15, use case 2
0 stars 2 forks source link

use case 2.3 - multimodal dataset #4

Open robertoostenveld opened 7 months ago

robertoostenveld commented 7 months ago

The National Institute of Mental Health (NIMH) Intramural Healthy Volunteer Dataset

The data set includes a large amount of tabular data characterising human participants, structural MRI, and MEG data.

Results to be produced are of the tabular type, computing xxx on the tabular data, and xxx from the brain imaging data. XXX

Relies on matlab/FieldTrip - the dataset includes 157 participants (357GB)

robertoostenveld commented 6 months ago

Following discussion with @schoffelen and @marcelzwiers we decided to change this use case to another multimodal dataset also available from openneuro: https://doi.org/10.18112/openneuro.ds000117.v1.0.6

This is a better known dataset for which it will take us less time to implement some interesting pipeline, as it is used in the SPM manual and quite a few of the publications in this special issue.

robertoostenveld commented 6 months ago

@schoffelen given the failure so far to make a MATLAB based pipeline in #5, would it make sense to combine this dataset with the pipeline at https://github.com/robertoostenveld/Wakeman-and-Henson-2015?

My pipeline on this data (*) predates the BIDS version of the dataset and the subject specific detail files are not strictly needed any more, as BIDS enforces consistency. In any case it will need some updating of the pipeline code.

*) I made this for the 2016 BIOMAG, at which we also started discussing BIDS-MEG

schoffelen commented 6 months ago

Yes, this, or we could consider the code that we wrote for practicalEEG in 2019: https://www.fieldtriptoolbox.org/workshop/paris2019/

schoffelen commented 6 months ago

Note for reference here: datalad does not work on a CentOS node, because the git version is too old. For development of the pipeline at the DCCN, it should be done on a slurm node.

Also note (JM will update this): the instructions in the README.md should also specify the installation of git-annex:

datalad-installer git-annex -m datalad/git-annex:release --install-dir venv

subsequent moving of the git-annex specific stuff from venv/usr/lib and venv/usr/bin into venv/lib and venv/bin (or otherwise export the PATH so that the git-annex can be found later on).

robertoostenveld commented 6 months ago

In the READMEs we should document the (minimum) version of system-wide dependencies like git, as was also detected by @marcelzwiers in #11 for node.js.

I already had used datalad in the past on my MacBook, which is probably why the dependencies were already satisfied for me.

robertoostenveld commented 6 months ago

But some dependencies are not trivial to detect, especially when it "just works" for the first person to implement (part of) the pipeline.

robertoostenveld commented 6 months ago

The paris2019 pipelines are already documented in quite some detail - and of course implemented for the BIDS version. Using them would probably also make the use case the heaviest regarding compute requirements (it is already in terms of data size). Is it desirable to have such a big computational load? It will make running it repeatedly by different people (which I would expect to be required) not so attractive.

Looking at the sequence, I think that the frequency analysis part (day one, afternoon) can be skipped, as the group source stats does not depend on it. I am not sure whether there is a nice group-level end-point of the analysis. I recall that the group statistics were a bit disappointing and that we did not further optimize the processing over participants. Perhaps we may need to consider changing the last step (group source stats).

But returning to the computational load: should we not first get something lighter in place and start discussing with SIESTA partners based on that, rather than making it too ambitious?

robertoostenveld commented 6 months ago

@schoffelen could you give it a try with the FT version in https://github.com/fieldtrip/fieldtrip/pull/2416 ?

schoffelen commented 6 months ago

Yes, will do. An initial try seems to be running through find. I'll stop for now, and deploy it once I leave the office (and just for the fun of it do all subjects, rather than 10:16. For now it's too annoying that each time a figure is created my other MATLAB session is interrupted (and the figure pops up)

robertoostenveld commented 6 months ago

Oh, silly mistake with the 10:16 subset 🤦. Please fix it, and other stuff you might find.

The figures are indeed annoying. For me the whole computation for all 16 participants was done in some 4 hours or so, but that was with only a single run from the 6. You may also want to update this line

https://github.com/SIESTA-eu/wp15/blob/7df31b7793f89a115ce5515239265af1183ace58/usecase-2.3/analyze_all_subjects.m#L7

to have it execute all runs. Do we have some quota on the HPC cluster that we could use to execute this? I have an old project (3011231.02) that is still active and that has 650GB of storage that we could use as scratch space. I will add you to it as collaborator.

schoffelen commented 6 months ago

Marcel and I have some storage space on our staff scientists' project number, I think that the amount of space there is also on the order of 600 GB or so.

robertoostenveld commented 4 months ago

Both use case 2.3 and 2.4 are based on MATLAB, and both compute an ERP or ERF. To increase the diversity of the use cases, we could modify 2.3 so thaht it uses MNE python to compute the ERFs instead of FieldTrip.

What do you think @schoffelen ? This would mainly require your input. The pipeline could be made even simpler than the current one, for example only compute the N170 difference ERF over the first run (from 6) and then averaged over participants. No artifact rejection or such, just read+segment+average+grandaverage+difference.

schoffelen commented 4 months ago

My suggestion would be to first simplify the pipeline in Matlab to prototype it. Given the challenges in getting 2.4 robustly up and running it may be good to have a FieldTrip-only Matlab use case without the challenging eeglab/limo dependencies.

I think I could pull off a simple MNE-based pipeline as sketched above

robertoostenveld commented 3 months ago

@schoffelen, please note that I did another cleanup of the code. Besides making use of participants (allowing for re-pseudonimization and leave-one-out resampling), an important change is that the analysis is now on the MaxFiltered files.

The bias in the group-level statistics due to the different number of trials in the combined planar averages still needs to be addressed.

robertoostenveld commented 3 months ago

The bias has also been addressed in the recent commit.

The next step is to containerize this pipeline in an apptainer. The https://hub.docker.com/r/mathworks/matlab docker images might provide a good starting point. The license for those is implemented by setting an environment variable that points to the network license manager that is to be contacted.

robertoostenveld commented 3 months ago

I have constructed a pipeline.def for the apptainer (including MATLAB) and tested it on a local linux computer and added it to the wp15 repo.

@marcelzwiers could also try to build the actual apptainer (test 1) and can you then try to run it on the shared data on our project directory (test 2)?

robertoostenveld commented 3 months ago

@marcelzwiers I managed to execute the container on a SLURM compute node, but there are errors inside the container in the path, the data handling and the analysis scrip. I have to fix those first, so don't bother testing yet.

robertoostenveld commented 2 months ago

@marcelzwiers I have updated the container and have been able to execute it on a SLURM compute node with 16GB of RAM. Can you try whether you can also execute the participant and group level analysis according to documentation?

marcelzwiers commented 2 months ago

The first thing I get is:

$ apptainer run --no-home --env MLM_LICENSE_FILE=port@host pipeline.sif ../usecase-2.3/input/ output participant
mkdir: cannot create directory '/home/mrphys/marzwi/.MathWorks': Read-only file system
marcelzwiers commented 2 months ago

Then it continues and gives:

FieldTrip is developed by members and collaborators of the Donders Institute for Brain,
Cognition and Behaviour at Radboud University, Nijmegen, the Netherlands.

                          --------------------------
                        /                            \
                     ------------------------------------
                    /                                    \
          -------------------------------------------------
         /                            /\/\/\/\/\ 
         ---------------------------------------------------
                  |        F  i  e  l  d  T  r  i  p       |
                  ------------------------------------------
                   \                                      /
                     ------------------------------------
                          \            /
                            ----------

Please cite the FieldTrip reference paper when you have used FieldTrip in your study.
Robert Oostenveld, Pascal Fries, Eric Maris, and Jan-Mathijs Schoffelen. FieldTrip: Open
Source Software for Advanced Analysis of MEG, EEG, and Invasive Electrophysiological Data.
Computational Intelligence and Neuroscience, vol. 2011, Article ID 156869, 9 pages, 2011.
doi:10.1155/2011/156869.
-------------------------------------------------------------------------------------------
Warning: enabling online tracking of FieldTrip usage, see
http://www.fieldtriptoolbox.org/faq/tracking 
 In '/work/fieldtrip/utilities/ft_trackusage.m' at line 97
 In '/work/fieldtrip/ft_defaults.m' at line 409

Error using mkdir
Read-only file system

Error in analyze_participant (line 13)
mkdir(outputprefix);
robertoostenveld commented 2 months ago

What have you specified as the output directory and is it properly mounted in the container with the --bind option? Can you do apptainer shell --bind xxx:xxx pipeline.sif and check that the input and output directories are available inside the container?

robertoostenveld commented 2 months ago

The first thing I get is:

$ apptainer run --no-home --env MLM_LICENSE_FILE=port@host pipeline.sif ../usecase-2.3/input/ output participant
mkdir: cannot create directory '/home/mrphys/marzwi/.MathWorks': Read-only file system

I don't think this is an error, but just a warning that it cannot save MATLAB defaults to the home directory inside the container. I don't see a reason to assume that it affects pipeline execution.

marcelzwiers commented 2 months ago

What have you specified as the output directory and is it properly mounted in the container with the --bind option? Can you do apptainer shell --bind xxx:xxx pipeline.sif and check that the input and output directories are available inside the container?

The data is certainly there, as you can see if I list the input directory:

$ apptainer exec --no-home pipeline.sif ls -l ../usecase-2.3/input
total 164
-rw-r--r--+  1 nobody nogroup 1901 Jun 10 09:50 CHANGES
-rw-r--r--+  1 nobody nogroup 6060 Jun 10 09:50 README
-rw-r--r--+  1 nobody nogroup 1372 Jun 10 09:50 acq-mprage_T1w.json
-rw-r--r--+  1 nobody nogroup 1102 Jun 10 09:50 dataset_description.json
drwxr-xr-x+  4 nobody nogroup 4096 Jun 10 09:50 derivatives
-rw-r--r--+  1 nobody nogroup  333 Jun 10 09:50 participants.tsv
-rw-r--r--+  1 nobody nogroup   82 Jun 10 09:50 run-1_echo-1_FLASH.json
-rw-r--r--+  1 nobody nogroup   82 Jun 10 09:50 run-1_echo-2_FLASH.json
-rw-r--r--+  1 nobody nogroup   82 Jun 10 09:50 run-1_echo-3_FLASH.json
[etc]
marcelzwiers commented 2 months ago

I can also use mkdir as normally:

dccn-c062/marzwi$ apptainer exec --no-home pipeline.sif mkdir test
dccn-c062/marzwi$ ls -l
total 2363892
drwxr-xr-x  2 marzwi       4096 Aug 28 11:20 test/
-rwxr-xr-x  1 marzwi 2411110400 Aug 27 14:25 pipeline.sif*
robertoostenveld commented 2 months ago

The way you specify the apptainer commands you are relying on the local directory being mounted and writable from within the container. I think you should more explicitly specify the full path and use --bind to mount them. Something like

apptainer run --bind /project/30xxxx/siesta/usercase-2.3/input:/work/input --bind /project/30xxxx/siesta/usercase-2.3/output:/work/output pipeline.sif /work/input /work/output participant
marcelzwiers commented 2 months ago

It was working when I did a direct bind of the input folder (as you suggested), but that behavior is just odd and was only working due to a bug in the input parsing (now fixed)

marcelzwiers commented 2 months ago

So on mentat001s this now all works normally (i.e. with the default DCCN binds):

$ apptainer run --no-home --env MLM_LICENSE_FILE=port@host pipeline.sif ../usecase-2.3/input/ output participant
$ apptainer run --no-home --env MLM_LICENSE_FILE=port@host pipeline.sif ../usecase-2.3/input/ output group
robertoostenveld commented 2 months ago

The consideration for the input and outputdir to be specified as environment variables into MATLAB was that that would allow for spaces, dashes/minus or other non-ideal symbols to be used in the path name, and that I prefer the top-level analysis to be implemented as a script, not a function (for debugging reasons). When I am back, we should look into why my first implementation failed for you but worked for me. But for now happy that you can execute the pipeline without knowing i detail what it is doing.

marcelzwiers commented 2 months ago

The old code wasn't catching spaces also, but I now fixed that. Using input arguments is best practice and now running code inside or outside the container is exactly the same. Debugging can of course be easily done using dbstops or temporarily removing the first line.