Washington-University / HCPpipelines

Processing pipelines for the HCP
https://www.humanconnectome.org/
Other
542 stars 272 forks source link

the sICA reclean pipeline #279

Closed cyang31 closed 11 months ago

cyang31 commented 11 months ago

Background

The sICA+FIX used for the official release of HCP-YA, HCP-D and HCP-A has imperfect component predictions. FIX also does not scale efficiently when applied to larger datasets due to the limitations imposed by the SVM classifiers. A reclean pipeline aims to correct these predictions based on several lightweight base learners trained on large-scale sICA components.

Main workflow ICAFIX/ApplyAutoRecleanPipeline.sh

Pretrained models

The pretrained models are saved under folder ICAFIX/rclean_models in the format of joblib from python.

Evaluation

The tests are run for a single-run reclean and a multi-run reclean, and they both finished smoothly.

SR:

MR:

Next step

  1. The singularity container is too large to share on GitHub (11GB), so it needs to be shared using a better way. We can paste a OneDrive or Box link to the README, and maybe provide with a recipe. Need to discuss with @coalsont and @glasserm.
  2. After this PR is merged, @demsarjure needs to integrate it into QuNex.
cyang31 commented 11 months ago

Thanks Alex. MapVolumeToCIFTI is a global script not a pipeline specific script. Also, shouldn't this be a per-subject pipeline rather than a a group level pipeline? That is call once per subject rather than with a subject list? That way it can be parallelized across subjects.

Okay, I will move MapVolumeToCIFTI to the global/scripts.

This pipeline can also be run per-subject if a user just feeds one subject to the argument "--subject-list". Shall we force this pipeline to take one subject at a time instead?

glasserm commented 11 months ago

I guess the comparable pipelines are 1 subject at a time. Open to others thoughts on this though.

cyang31 commented 11 months ago

The reclean pipeline is now changed to per-subject pipeline.

cyang31 commented 11 months ago

I am reposting my message from Slack here:

I skimmed over the code and I see that you are using a Singularity container for python3. Inside the QuNex container we already have python3 so running a singularity container within a singularity container is not necessary and a bit funky. Could you change the code so if --python-singularity is set to "NONE" or something similar commonly used in HCP Pipelines, then python3 is ran directly, without singularity exec.

Hi Jure, I wonder if such a functionality "--python-singularity=NONE" should be exclusively implemented inside the Qunex. I can provide you with a list of Python modules that are needed to be installed. Python environment setup for HCP pipeline users hasn't been explored before and I assume many existing users might be hesitant to undertake the route of conda/python virtualenv, that's why singularity container is used here since it is much easier to use directly.

Or, we can do something like this (pseudo-code) for the HCP pipeline under the non-qunex version,

if PythonSingularity is None:
  raise ValueError("please run the reclean pipeline within a singularity container")
else:
  singularity exec PythonSingularity python3 RecleanClassifierInference.py

And for the qunex version:

if PythonSingularity is None:
  conda activate sICAReclean # switch to a conda environment
  python3 RecleanClassifierInference.py
else:
  raise ValueError("please don't specify a singularity container for the reclean pipeline within qunex")
demsarjure commented 11 months ago

How about a simpler solution:

if [ $PythonSingularity == "NONE" ]; then
    pythonCode=(
        "python3 $HCPPIPEDIR/ICAFIX/scripts/RecleanClassifierInference.py"
        "--input_csv=$RecleanFeaturePath"
        "--input_fix_prob_csv=$FixProbPath"
        "--fix_prob_threshold=$FixProbThresh"
        "--trained_folder=$ModelFolder"
        "--model=$ModelToUse"
        "--output_folder=$PredictionResult"
        "--voting_threshold=$VoteThresh"
        "--reclassify_as_signal_file=$ReclassifyAsSignalTxt"
        "--reclassify_as_noise_file=$ReclassifyAsNoiseTxt"
    )
else
    pythonCode=(
        "singularity exec --bind /media $PythonSingularity python3 $HCPPIPEDIR/ICAFIX/scripts/RecleanClassifierInference.py"
        "--input_csv=$RecleanFeaturePath"
        "--input_fix_prob_csv=$FixProbPath"
        "--fix_prob_threshold=$FixProbThresh"
        "--trained_folder=$ModelFolder"
        "--model=$ModelToUse"
        "--output_folder=$PredictionResult"
        "--voting_threshold=$VoteThresh"
        "--reclassify_as_signal_file=$ReclassifyAsSignalTxt"
        "--reclassify_as_noise_file=$ReclassifyAsNoiseTxt"
    )
fi

And I take care of the python env setup within QuNex? There is already a conda environment called QuNex within the container and I can easily add the needed packages (if they are not already in there that is).

cyang31 commented 11 months ago

How about a simpler solution:

if [ $PythonSingularity == "NONE" ]; then
    pythonCode=(
        "python3 $HCPPIPEDIR/ICAFIX/scripts/RecleanClassifierInference.py"
        "--input_csv=$RecleanFeaturePath"
        "--input_fix_prob_csv=$FixProbPath"
        "--fix_prob_threshold=$FixProbThresh"
        "--trained_folder=$ModelFolder"
        "--model=$ModelToUse"
        "--output_folder=$PredictionResult"
        "--voting_threshold=$VoteThresh"
        "--reclassify_as_signal_file=$ReclassifyAsSignalTxt"
        "--reclassify_as_noise_file=$ReclassifyAsNoiseTxt"
    )
else
    pythonCode=(
        "singularity exec --bind /media $PythonSingularity python3 $HCPPIPEDIR/ICAFIX/scripts/RecleanClassifierInference.py"
        "--input_csv=$RecleanFeaturePath"
        "--input_fix_prob_csv=$FixProbPath"
        "--fix_prob_threshold=$FixProbThresh"
        "--trained_folder=$ModelFolder"
        "--model=$ModelToUse"
        "--output_folder=$PredictionResult"
        "--voting_threshold=$VoteThresh"
        "--reclassify_as_signal_file=$ReclassifyAsSignalTxt"
        "--reclassify_as_noise_file=$ReclassifyAsNoiseTxt"
    )
fi

And I take care of the python env setup within QuNex? There is already a conda environment called QuNex within the container and I can easily add the needed packages (if they are not already in there that is).

The reason why I didn't recommend this simple approach is that " $PythonSingularity == "NONE" " will throw an error for the non-qunex hcp pipeline, if the user's local original python environment doesn't match with what is inside the singularity container. In this way, we should give them an instruction to install our recommended python environment anyway, which I am not sure if this is what @glasserm and @coalsont would prefer.

demsarjure commented 11 months ago

Yes, but you can make $PythonSingularity not equal to NONE as a default so only users that know what they are doing will turn this off. I would prefer not having another python3 environment inside the container. But if that is the route you really want to take, I can work with that as well.

cyang31 commented 11 months ago

Yes, but you can make $PythonSingularity not equal to NONE as a default so only users that know what they are doing will turn this off. I would prefer not having another python3 environment inside the container. But if that is the route you really want to take, I can work with that as well.

Yes, this makes more sense. May I ask what are the scripts that are run via the conda QuNex? And where can I find the current installed modules for conda QuNex?

demsarjure commented 11 months ago

You can enter the QuNex container (singularity shell <container_path>) and source /opt/qunex/env/qunex_environment.sh this will load the qunex conda environment. You can then pip list. Here is the printout (a lot of packages are part of "core" python, we do not install all of this):

alabaster                     0.7.12
astcheck                      0.3.0
astsearch                     0.2.0
attrs                         23.1.0
Babel                         2.11.0
backcall                      0.2.0
beautifulsoup4                4.12.2
bleach                        6.0.0
brotlipy                      0.7.0
certifi                       2022.12.7
cffi                          1.15.1
charset-normalizer            2.0.4
colorama                      0.4.6
comm                          0.1.4
cryptography                  39.0.1
cycler                        0.11.0
Cython                        0.29.33
debugpy                       1.6.7.post1
decorator                     5.1.1
defusedxml                    0.7.1
dill                          0.3.7
docutils                      0.18.1
entrypoints                   0.4
fastjsonschema                2.18.0
fonttools                     4.25.0
fslpy                         3.13.3
gradunwarp                    1.2.1+hcp
h5py                          3.8.0
hcpasl                        0.1.4.post1
idna                          3.4
imageio                       2.31.2
imagesize                     1.4.1
importlib-metadata            4.11.3
importlib-resources           5.12.0
ipykernel                     6.16.2
ipython                       7.34.0
ipywidgets                    8.1.0
jedi                          0.19.0
Jinja2                        3.1.2
joblib                        1.3.2
jsonschema                    4.17.3
jupyter_client                7.4.9
jupyter_core                  4.12.0
jupyterlab-pygments           0.2.2
jupyterlab-widgets            3.0.8
kiwisolver                    1.4.4
lxml                          4.9.3
MarkupSafe                    2.1.1
matplotlib                    3.5.3
matplotlib-inline             0.1.6
mistune                       3.0.1
mkl-fft                       1.3.1
mkl-random                    1.2.2
mkl-service                   2.4.0
munkres                       1.1.4
nbclient                      0.7.4
nbconvert                     7.6.0
nbformat                      5.8.0
nbparameterise                0.5
nest-asyncio                  1.5.7
nibabel                       3.2.2
nilearn                       0.10.1
nose                          1.3.7
numpy                         1.21.6
packaging                     21.3
pandas                        1.3.5
pandocfilters                 1.5.0
parso                         0.8.3
pexpect                       4.8.0
pickleshare                   0.7.5
Pillow                        9.4.0
pip                           22.3.1
pkgutil_resolve_name          1.3.10
platformdirs                  3.10.0
ply                           3.11
pooch                         1.7.0
prompt-toolkit                3.0.39
psutil                        5.9.5
ptyprocess                    0.7.0
pycparser                     2.21
pydicom                       2.4.2
pyfab                         0.3.5
Pygments                      2.11.2
pyOpenSSL                     23.0.0
pyparsing                     3.0.9
PyQt5-sip                     12.11.0
pyrsistent                    0.19.3
PySocks                       1.7.1
python-dateutil               2.8.2
pytz                          2022.7
pyvista                       0.38.6
pyzmq                         25.1.1
regtricks                     0.3.6
requests                      2.28.1
rpy2                          2.8.6
scikit-learn                  1.0.2
scipy                         1.7.3
scooby                        0.7.2
setuptools                    65.6.3
sip                           6.6.2
six                           1.16.0
snowballstemmer               2.2.0
soupsieve                     2.4.1
Sphinx                        5.0.2
sphinxcontrib-applehelp       1.0.2
sphinxcontrib-devhelp         1.0.2
sphinxcontrib-htmlhelp        2.0.0
sphinxcontrib-jsmath          1.0.1
sphinxcontrib-qthelp          1.0.3
sphinxcontrib-serializinghtml 1.1.5
threadpoolctl                 3.1.0
tinycss2                      1.2.1
toblerone                     0.8.2
toml                          0.10.2
tornado                       6.2
tqdm                          4.66.1
traitlets                     5.9.0
trimesh                       3.23.5
typing_extensions             4.7.1
urllib3                       1.26.8
vtk                           9.2.6
wcwidth                       0.2.6
webencodings                  0.5.1
wheel                         0.38.4
widgetsnbextension            4.0.8
zipp                          3.11.0
cyang31 commented 11 months ago

@demsarjure Jure, which way do you prefer to install the python modules? I can provide you with a list outputted by pip list in the container or do you want me to share the singularity with you and just let you play with it?

demsarjure commented 11 months ago

I think I have all the information I need, so you do not need to do anything I believe :). Based on the Python scripts in this pipeline I feel like the only thing we need is XGBoost?

cyang31 commented 11 months ago

I think I have all the information I need, so you do not need to do anything I believe :). Based on the Python scripts in this pipeline I feel like the only thing we need is XGBoost?

That could be true, but you may also want to check if the models in joblib format (from this PR) can be properly loaded after you installed the XGBoost module in your container. Otherwise, you may have to adjust the package version to what I used to generate these model weights:

To check if the models can be loaded properly, here's a start for the code:

import joblib

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier 
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier

model=joblib.load('ICAFIX/rclean_models/Xgboost.joblib')
mharms commented 11 months ago

Regarding all this PythonSingularity stuff: I understand why @cyang31 used a container while developing this, but do we want/need to support that going forward in the HCPpipelines itself? Personally, it would seem easier to me to maintain this Reclean pipeline going forward if we eliminate this "dev" container entirely, and just tell users to run it in the context of qunex.

glasserm commented 11 months ago

No we are not requiring users to run QuNex to run the HCP Pipelines.

mharms commented 11 months ago

Sure, but this pipeline is fundamentally different, and would require them to obtain and use a different container instead. Given that, why not just make qunex the sole supported container?

glasserm commented 11 months ago

Because QuNex elected to completely rewrite the usage for the HCP Pipelines and has its own group of developers and interests. Running an container as a part of HCP Pipelines is not at all a big deal. This will likely be required for the temporal ICA classifier as well.

glasserm commented 11 months ago

Alex if there are actually only a small number of python dependencies, please make a non-container mode for this as well.

mharms commented 11 months ago

Running it might not be big deal per se, but I think that over time maintaining it might be an issue. And then there's the issue of how to distribute it, which would be less of an issue if it wasn't so large. Does it really need to be 11 GB? (Can't it just be a lightweight container with the minimal necessary set of python modules?)

cyang31 commented 11 months ago

Matt, reclean pipeline needs only a small number of python dependencies but temporal ICA classifier needs lots more. I see Mike's point. I think it should be doable to match the python environment of qunex singularity container with the singularity container I have for reclean or temporal ICA. In this case, we only need one container to run all the python related pipelines.

mharms commented 11 months ago

And can the size of that be much smaller than 11 GB?

glasserm commented 11 months ago

Yes we should have at most only one container, but instructions for a non-container version and an option for that would also be appreciated.

cyang31 commented 11 months ago

And can the size of that be much smaller than 11 GB?

Note that the singularity is used on my end for BOTH sICA reclean/retraining and tICA classification/classifier retraining. After checking the singularity I developed, the 11GB size can be broken down into these main pieces:

glasserm commented 11 months ago

Here are the limits on file sizes: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage

Not sure which apply to us, but I don't think we are using the Free version as I get a bill every so often.

cyang31 commented 11 months ago

That's why I think a WUSTL Box or OneDrive links to share the singularity might be a better choice.

mharms commented 11 months ago

Are there lightweight versions of any of the things that you listed that would be smaller in size, but satisfy what is needed? e.g., could we get by with a sub-set of the nvidia stuff? Does the default PyTorch come with stuff like pre-defined models that we don't need?

cyang31 commented 11 months ago

The CUDA stuff might be super hard to shrink its size. What I think might be a workaround strategy is that: we don't need to provide the GPU interface, as well as the ability for the users to retrain the classifiers under GPU mode. In this case, we can eliminate all the CUDA stuff and install a CPU-only torch instead of a CPU+GPU torch, which should decrease the total size a lot by removing the 2.6GB CUDA and part of the 3.5GB torch. But meanwhile, the users can only use the model weights to inference on their ends, not retraining. That's a tradeoff which needs discussion if we want to take this solution.

mharms commented 11 months ago

Is there a retraining option within the Reclean pipeline itself?

cyang31 commented 11 months ago

Is there a retraining option within the Reclean pipeline itself?

No, I haven't integrated it into the pipeline itself. The current training is performed locally on brainmapper.

demsarjure commented 11 months ago

Where is pytorch even used? In the Python script, you are using sklearn's MLP?

cyang31 commented 11 months ago

Where is pytorch even used? In the Python script, you are using sklearn's MLP?

Hi Jure, because the tICA classifier will use features from deep learning models. I think it would be better to use a single container for both the reclean and tICA pipelines.

Yes, sklearn is the package mainly used for reclean pipeline.

demsarjure commented 11 months ago

Makes sense, thanks.

mharms commented 11 months ago

Is there a retraining option within the Reclean pipeline itself?

No, I haven't integrated it into the pipeline itself. The current training is performed locally on brainmapper.

Well, if the pipeline itself isn't going to support retraining, then I see no need for the publicly-released container to support that capability.

coalsont commented 11 months ago

I didn't expect my cleanup of a master merge to close this pull request, sorry. It can be recreated based on the same branch names, so I'm not sure exactly why github doesn't want to reopen it.

coalsont commented 11 months ago

Ah, I see, I pushed our master over @cyang31, because I forgot to use : notation on the push. I'll sort this out with him.

cyang31 commented 11 months ago

Thanks @coalsont. Everything is back to normal now.

coalsont commented 11 months ago

I don't really like the idea of putting a container image into the repository, period. Maybe a container build script, but not a container image, especially if the container is easy to build (and building won't break in the future by old versions becoming unavailable or updated dependencies being incompatible - we try to minimize the data files we put into the pipeline, and to some extent the amount of updating of compiled matlab binaries, for similar size reasons - remember, the default clone retrieves every commit ever made, which means every pushed version of every large file). The other pipelines simply expect/check that the expected tools are available in the current environment, and so the environment setup (including the container launch, if applicable) needs to be done in advance of starting the pipeline. We already have a dependency on the R-based version of fix, which is probably a less prevalent language than python.

mharms commented 11 months ago

I agree, although I don't know if anyone is (still) advocating putting the container into the repo itself. I'm arguing that we make it is as small as possible, but regardless of its size, if we are going to make a pre-built container image available that we host that somewhere other than in the HCPpipelines repo.

demsarjure commented 11 months ago

Are there any updates regarding this? What is an approximate timeline?

I am waiting for the outlined issues above to be resolved (mainly the possibility of running the pipeline without singularity, with PythonSingularity == "NONE"). I plan to integrate into QuNex once the last details are resolved and this is merged into master.

cyang31 commented 11 months ago

@demsarjure Hi Jure, I think @coalsont and @glasserm are caught up on something else and might be available to review this PR next week. At the meantime, I will create a new commit soon to include 1. instruction to install the necessary packages in conda, 2. the functionality of using PythonSingularity == "NONE". I assume, hopefully, the end of next week will be a reasonable time to get this PR merged.

Have you tried to load pretrained models using the qunex container?

glasserm commented 11 months ago

Jure I don't think there are likely to be major changes re QuNex. We need a mode that is not singularity inside of singularity. We need the code to build the singularity container in the repository. We should provide two options: Run in OS (what QuNex will also use) and a singularity container, which we could host somewhere other than GitHub. We should configure brainmappers to work with Run in OS.

demsarjure commented 11 months ago

OK, I will start integrating into QuNex and testing on the cyang31 branch. So once you merge into master, we are more or less good to go.

demsarjure commented 11 months ago

@cyang31 Yes, I was able to setup the environment. Three of the models in ICAFIX/rclean_models are loaded without any warnings. While loading XgboostEnsemble.joblib says ModuleNotFoundError: No module named 'xgboost_ensemble_classifier'. What is the story here, this is not a module that ships with the xgboost package?

cyang31 commented 11 months ago

Hi Jure, @demsarjure there is a script in the folder that defines this module.

demsarjure commented 11 months ago

Ah, I found it. All models now load properly. Thanks.

demsarjure commented 11 months ago

The folder name with models is /ICAFIX/rclean_models, is this intentional or is rclean a typo and it should be reclean?

cyang31 commented 11 months ago

The folder name with models is /ICAFIX/rclean_models, is this intentional or is rclean a typo and it should be reclean?

Hi Jure, this is intentional. The reason is that in the next release of Young Adult and Lifespan datasets, we add substring "rclean" to fMRI dense timeseries if it has been through the "sICA reclean" process, and substring "tclean" if it has been through the "temporal ICA cleanup" process. So, I keep the consistency here for the model folder.

For instance,

cyang31 commented 11 months ago

A latest update about the reclean pipeline:

glasserm commented 11 months ago

Can we use 1 as the native environment and 2 as the singularity call?

Thanks,

Matt.

From: Chunhui Yang @.> Reply-To: Washington-University/HCPpipelines @.> Date: Monday, December 18, 2023 at 1:32 PM To: Washington-University/HCPpipelines @.> Cc: "Glasser, Matt" @.>, Mention @.***> Subject: Re: [Washington-University/HCPpipelines] the sICA reclean pipeline (PR #279)

A latest update about the reclean pipeline:

— Reply to this email directly, view it on GitHubhttps://github.com/Washington-University/HCPpipelines/pull/279#issuecomment-1861421892, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALSNTHFU6WKFX4HAPKPHP3YKCK3ZAVCNFSM6AAAAABADXIMC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRRGQZDCOBZGI. You are receiving this because you were mentioned.Message ID: @.***>


The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

cyang31 commented 11 months ago

Can we use 1 as the native environment and 2 as the singularity call? Thanks, Matt. From: Chunhui Yang @.> Reply-To: Washington-University/HCPpipelines @.> Date: Monday, December 18, 2023 at 1:32 PM To: Washington-University/HCPpipelines @.> Cc: "Glasser, Matt" @.>, Mention @.> Subject: Re: [Washington-University/HCPpipelines] the sICA reclean pipeline (PR #279) A latest update about the reclean pipeline: Volume+MSMSulc+MSMAll will all be cleaned a "rclean" substring will be added to the recleaned files now it supports 1) a singularity call of python command, 2) a python command from native environment (installed via conda as instructed in ICAFIX/README.md) an example script to demonstrate the usage Examples/Scripts/ApplyAutoRecleanPipelineBatch.sh — Reply to this email directly, view it on GitHub<#279 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALSNTHFU6WKFX4HAPKPHP3YKCK3ZAVCNFSM6AAAAABADXIMC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRRGQZDCOBZGI. You are receiving this because you were mentioned.Message ID: @.> ____ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.

Hi Matt, the choice of the call type is not ordered by 1 or 2. It is controlled by the argument --python-singularity. If this argument is not specified then it will choose to run native environment with whatever python interpreter specified by Tim's new suggestion --python-interpreter or just default native environment if it is not specified either.

coalsont commented 11 months ago

If we internally support singularity, we might add support for something else in the future, making this option no longer binary. So, it should probably be named --python-mode, with initial possibilities of native or singularity. Using numbers to represent them may not be the best choice, as strings would be more obvious.