Closed cyang31 closed 11 months ago
Thanks Alex. MapVolumeToCIFTI is a global script not a pipeline specific script. Also, shouldn't this be a per-subject pipeline rather than a a group level pipeline? That is call once per subject rather than with a subject list? That way it can be parallelized across subjects.
Okay, I will move MapVolumeToCIFTI to the global/scripts.
This pipeline can also be run per-subject if a user just feeds one subject to the argument "--subject-list". Shall we force this pipeline to take one subject at a time instead?
I guess the comparable pipelines are 1 subject at a time. Open to others thoughts on this though.
The reclean pipeline is now changed to per-subject pipeline.
I am reposting my message from Slack here:
I skimmed over the code and I see that you are using a Singularity container for python3. Inside the QuNex container we already have python3 so running a singularity container within a singularity container is not necessary and a bit funky. Could you change the code so if --python-singularity is set to "NONE" or something similar commonly used in HCP Pipelines, then python3 is ran directly, without singularity exec.
Hi Jure, I wonder if such a functionality "--python-singularity=NONE" should be exclusively implemented inside the Qunex. I can provide you with a list of Python modules that are needed to be installed. Python environment setup for HCP pipeline users hasn't been explored before and I assume many existing users might be hesitant to undertake the route of conda/python virtualenv, that's why singularity container is used here since it is much easier to use directly.
Or, we can do something like this (pseudo-code) for the HCP pipeline under the non-qunex version,
if PythonSingularity is None:
raise ValueError("please run the reclean pipeline within a singularity container")
else:
singularity exec PythonSingularity python3 RecleanClassifierInference.py
And for the qunex version:
if PythonSingularity is None:
conda activate sICAReclean # switch to a conda environment
python3 RecleanClassifierInference.py
else:
raise ValueError("please don't specify a singularity container for the reclean pipeline within qunex")
How about a simpler solution:
if [ $PythonSingularity == "NONE" ]; then
pythonCode=(
"python3 $HCPPIPEDIR/ICAFIX/scripts/RecleanClassifierInference.py"
"--input_csv=$RecleanFeaturePath"
"--input_fix_prob_csv=$FixProbPath"
"--fix_prob_threshold=$FixProbThresh"
"--trained_folder=$ModelFolder"
"--model=$ModelToUse"
"--output_folder=$PredictionResult"
"--voting_threshold=$VoteThresh"
"--reclassify_as_signal_file=$ReclassifyAsSignalTxt"
"--reclassify_as_noise_file=$ReclassifyAsNoiseTxt"
)
else
pythonCode=(
"singularity exec --bind /media $PythonSingularity python3 $HCPPIPEDIR/ICAFIX/scripts/RecleanClassifierInference.py"
"--input_csv=$RecleanFeaturePath"
"--input_fix_prob_csv=$FixProbPath"
"--fix_prob_threshold=$FixProbThresh"
"--trained_folder=$ModelFolder"
"--model=$ModelToUse"
"--output_folder=$PredictionResult"
"--voting_threshold=$VoteThresh"
"--reclassify_as_signal_file=$ReclassifyAsSignalTxt"
"--reclassify_as_noise_file=$ReclassifyAsNoiseTxt"
)
fi
And I take care of the python env setup within QuNex? There is already a conda environment called QuNex within the container and I can easily add the needed packages (if they are not already in there that is).
How about a simpler solution:
if [ $PythonSingularity == "NONE" ]; then pythonCode=( "python3 $HCPPIPEDIR/ICAFIX/scripts/RecleanClassifierInference.py" "--input_csv=$RecleanFeaturePath" "--input_fix_prob_csv=$FixProbPath" "--fix_prob_threshold=$FixProbThresh" "--trained_folder=$ModelFolder" "--model=$ModelToUse" "--output_folder=$PredictionResult" "--voting_threshold=$VoteThresh" "--reclassify_as_signal_file=$ReclassifyAsSignalTxt" "--reclassify_as_noise_file=$ReclassifyAsNoiseTxt" ) else pythonCode=( "singularity exec --bind /media $PythonSingularity python3 $HCPPIPEDIR/ICAFIX/scripts/RecleanClassifierInference.py" "--input_csv=$RecleanFeaturePath" "--input_fix_prob_csv=$FixProbPath" "--fix_prob_threshold=$FixProbThresh" "--trained_folder=$ModelFolder" "--model=$ModelToUse" "--output_folder=$PredictionResult" "--voting_threshold=$VoteThresh" "--reclassify_as_signal_file=$ReclassifyAsSignalTxt" "--reclassify_as_noise_file=$ReclassifyAsNoiseTxt" ) fi
And I take care of the python env setup within QuNex? There is already a conda environment called QuNex within the container and I can easily add the needed packages (if they are not already in there that is).
The reason why I didn't recommend this simple approach is that " $PythonSingularity == "NONE" " will throw an error for the non-qunex hcp pipeline, if the user's local original python environment doesn't match with what is inside the singularity container. In this way, we should give them an instruction to install our recommended python environment anyway, which I am not sure if this is what @glasserm and @coalsont would prefer.
Yes, but you can make $PythonSingularity
not equal to NONE
as a default so only users that know what they are doing will turn this off. I would prefer not having another python3 environment inside the container. But if that is the route you really want to take, I can work with that as well.
Yes, but you can make
$PythonSingularity
not equal toNONE
as a default so only users that know what they are doing will turn this off. I would prefer not having another python3 environment inside the container. But if that is the route you really want to take, I can work with that as well.
Yes, this makes more sense. May I ask what are the scripts that are run via the conda QuNex? And where can I find the current installed modules for conda QuNex?
You can enter the QuNex container (singularity shell <container_path>
) and source /opt/qunex/env/qunex_environment.sh
this will load the qunex
conda environment. You can then pip list
. Here is the printout (a lot of packages are part of "core" python, we do not install all of this):
alabaster 0.7.12
astcheck 0.3.0
astsearch 0.2.0
attrs 23.1.0
Babel 2.11.0
backcall 0.2.0
beautifulsoup4 4.12.2
bleach 6.0.0
brotlipy 0.7.0
certifi 2022.12.7
cffi 1.15.1
charset-normalizer 2.0.4
colorama 0.4.6
comm 0.1.4
cryptography 39.0.1
cycler 0.11.0
Cython 0.29.33
debugpy 1.6.7.post1
decorator 5.1.1
defusedxml 0.7.1
dill 0.3.7
docutils 0.18.1
entrypoints 0.4
fastjsonschema 2.18.0
fonttools 4.25.0
fslpy 3.13.3
gradunwarp 1.2.1+hcp
h5py 3.8.0
hcpasl 0.1.4.post1
idna 3.4
imageio 2.31.2
imagesize 1.4.1
importlib-metadata 4.11.3
importlib-resources 5.12.0
ipykernel 6.16.2
ipython 7.34.0
ipywidgets 8.1.0
jedi 0.19.0
Jinja2 3.1.2
joblib 1.3.2
jsonschema 4.17.3
jupyter_client 7.4.9
jupyter_core 4.12.0
jupyterlab-pygments 0.2.2
jupyterlab-widgets 3.0.8
kiwisolver 1.4.4
lxml 4.9.3
MarkupSafe 2.1.1
matplotlib 3.5.3
matplotlib-inline 0.1.6
mistune 3.0.1
mkl-fft 1.3.1
mkl-random 1.2.2
mkl-service 2.4.0
munkres 1.1.4
nbclient 0.7.4
nbconvert 7.6.0
nbformat 5.8.0
nbparameterise 0.5
nest-asyncio 1.5.7
nibabel 3.2.2
nilearn 0.10.1
nose 1.3.7
numpy 1.21.6
packaging 21.3
pandas 1.3.5
pandocfilters 1.5.0
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
Pillow 9.4.0
pip 22.3.1
pkgutil_resolve_name 1.3.10
platformdirs 3.10.0
ply 3.11
pooch 1.7.0
prompt-toolkit 3.0.39
psutil 5.9.5
ptyprocess 0.7.0
pycparser 2.21
pydicom 2.4.2
pyfab 0.3.5
Pygments 2.11.2
pyOpenSSL 23.0.0
pyparsing 3.0.9
PyQt5-sip 12.11.0
pyrsistent 0.19.3
PySocks 1.7.1
python-dateutil 2.8.2
pytz 2022.7
pyvista 0.38.6
pyzmq 25.1.1
regtricks 0.3.6
requests 2.28.1
rpy2 2.8.6
scikit-learn 1.0.2
scipy 1.7.3
scooby 0.7.2
setuptools 65.6.3
sip 6.6.2
six 1.16.0
snowballstemmer 2.2.0
soupsieve 2.4.1
Sphinx 5.0.2
sphinxcontrib-applehelp 1.0.2
sphinxcontrib-devhelp 1.0.2
sphinxcontrib-htmlhelp 2.0.0
sphinxcontrib-jsmath 1.0.1
sphinxcontrib-qthelp 1.0.3
sphinxcontrib-serializinghtml 1.1.5
threadpoolctl 3.1.0
tinycss2 1.2.1
toblerone 0.8.2
toml 0.10.2
tornado 6.2
tqdm 4.66.1
traitlets 5.9.0
trimesh 3.23.5
typing_extensions 4.7.1
urllib3 1.26.8
vtk 9.2.6
wcwidth 0.2.6
webencodings 0.5.1
wheel 0.38.4
widgetsnbextension 4.0.8
zipp 3.11.0
@demsarjure Jure, which way do you prefer to install the python modules? I can provide you with a list outputted by pip list
in the container or do you want me to share the singularity with you and just let you play with it?
I think I have all the information I need, so you do not need to do anything I believe :). Based on the Python scripts in this pipeline I feel like the only thing we need is XGBoost?
I think I have all the information I need, so you do not need to do anything I believe :). Based on the Python scripts in this pipeline I feel like the only thing we need is XGBoost?
That could be true, but you may also want to check if the models in joblib format (from this PR) can be properly loaded after you installed the XGBoost module in your container. Otherwise, you may have to adjust the package version to what I used to generate these model weights:
To check if the models can be loaded properly, here's a start for the code:
import joblib
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
model=joblib.load('ICAFIX/rclean_models/Xgboost.joblib')
Regarding all this PythonSingularity
stuff:
I understand why @cyang31 used a container while developing this, but do we want/need to support that going forward in the HCPpipelines
itself? Personally, it would seem easier to me to maintain this Reclean pipeline going forward if we eliminate this "dev" container entirely, and just tell users to run it in the context of qunex
.
No we are not requiring users to run QuNex to run the HCP Pipelines.
Sure, but this pipeline is fundamentally different, and would require them to obtain and use a different container instead. Given that, why not just make qunex
the sole supported container?
Because QuNex elected to completely rewrite the usage for the HCP Pipelines and has its own group of developers and interests. Running an container as a part of HCP Pipelines is not at all a big deal. This will likely be required for the temporal ICA classifier as well.
Alex if there are actually only a small number of python dependencies, please make a non-container mode for this as well.
Running it might not be big deal per se, but I think that over time maintaining it might be an issue. And then there's the issue of how to distribute it, which would be less of an issue if it wasn't so large. Does it really need to be 11 GB? (Can't it just be a lightweight container with the minimal necessary set of python modules?)
Matt, reclean pipeline needs only a small number of python dependencies but temporal ICA classifier needs lots more. I see Mike's point. I think it should be doable to match the python environment of qunex singularity container with the singularity container I have for reclean or temporal ICA. In this case, we only need one container to run all the python related pipelines.
And can the size of that be much smaller than 11 GB?
Yes we should have at most only one container, but instructions for a non-container version and an option for that would also be appreciated.
And can the size of that be much smaller than 11 GB?
Note that the singularity is used on my end for BOTH sICA reclean/retraining and tICA classification/classifier retraining. After checking the singularity I developed, the 11GB size can be broken down into these main pieces:
Here are the limits on file sizes: https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-git-large-file-storage
Not sure which apply to us, but I don't think we are using the Free version as I get a bill every so often.
That's why I think a WUSTL Box or OneDrive links to share the singularity might be a better choice.
Are there lightweight versions of any of the things that you listed that would be smaller in size, but satisfy what is needed? e.g., could we get by with a sub-set of the nvidia stuff? Does the default PyTorch come with stuff like pre-defined models that we don't need?
The CUDA stuff might be super hard to shrink its size. What I think might be a workaround strategy is that: we don't need to provide the GPU interface, as well as the ability for the users to retrain the classifiers under GPU mode. In this case, we can eliminate all the CUDA stuff and install a CPU-only torch instead of a CPU+GPU torch, which should decrease the total size a lot by removing the 2.6GB CUDA and part of the 3.5GB torch. But meanwhile, the users can only use the model weights to inference on their ends, not retraining. That's a tradeoff which needs discussion if we want to take this solution.
Is there a retraining option within the Reclean pipeline itself?
Is there a retraining option within the Reclean pipeline itself?
No, I haven't integrated it into the pipeline itself. The current training is performed locally on brainmapper.
Where is pytorch even used? In the Python script, you are using sklearn's MLP?
Where is pytorch even used? In the Python script, you are using sklearn's MLP?
Hi Jure, because the tICA classifier will use features from deep learning models. I think it would be better to use a single container for both the reclean and tICA pipelines.
Yes, sklearn is the package mainly used for reclean pipeline.
Makes sense, thanks.
Is there a retraining option within the Reclean pipeline itself?
No, I haven't integrated it into the pipeline itself. The current training is performed locally on brainmapper.
Well, if the pipeline itself isn't going to support retraining, then I see no need for the publicly-released container to support that capability.
I didn't expect my cleanup of a master merge to close this pull request, sorry. It can be recreated based on the same branch names, so I'm not sure exactly why github doesn't want to reopen it.
Ah, I see, I pushed our master over @cyang31, because I forgot to use : notation on the push. I'll sort this out with him.
Thanks @coalsont. Everything is back to normal now.
I don't really like the idea of putting a container image into the repository, period. Maybe a container build script, but not a container image, especially if the container is easy to build (and building won't break in the future by old versions becoming unavailable or updated dependencies being incompatible - we try to minimize the data files we put into the pipeline, and to some extent the amount of updating of compiled matlab binaries, for similar size reasons - remember, the default clone retrieves every commit ever made, which means every pushed version of every large file). The other pipelines simply expect/check that the expected tools are available in the current environment, and so the environment setup (including the container launch, if applicable) needs to be done in advance of starting the pipeline. We already have a dependency on the R-based version of fix, which is probably a less prevalent language than python.
I agree, although I don't know if anyone is (still) advocating putting the container into the repo itself. I'm arguing that we make it is as small as possible, but regardless of its size, if we are going to make a pre-built container image available that we host that somewhere other than in the HCPpipelines repo.
Are there any updates regarding this? What is an approximate timeline?
I am waiting for the outlined issues above to be resolved (mainly the possibility of running the pipeline without singularity, with PythonSingularity == "NONE"
). I plan to integrate into QuNex once the last details are resolved and this is merged into master.
@demsarjure Hi Jure, I think @coalsont and @glasserm are caught up on something else and might be available to review this PR next week. At the meantime, I will create a new commit soon to include 1. instruction to install the necessary packages in conda, 2. the functionality of using PythonSingularity == "NONE"
. I assume, hopefully, the end of next week will be a reasonable time to get this PR merged.
Have you tried to load pretrained models using the qunex container?
Jure I don't think there are likely to be major changes re QuNex. We need a mode that is not singularity inside of singularity. We need the code to build the singularity container in the repository. We should provide two options: Run in OS (what QuNex will also use) and a singularity container, which we could host somewhere other than GitHub. We should configure brainmappers to work with Run in OS.
OK, I will start integrating into QuNex and testing on the cyang31
branch. So once you merge into master, we are more or less good to go.
@cyang31 Yes, I was able to setup the environment. Three of the models in ICAFIX/rclean_models
are loaded without any warnings. While loading XgboostEnsemble.joblib
says ModuleNotFoundError: No module named 'xgboost_ensemble_classifier'
. What is the story here, this is not a module that ships with the xgboost
package?
Hi Jure, @demsarjure there is a script in the folder that defines this module.
Ah, I found it. All models now load properly. Thanks.
The folder name with models is /ICAFIX/rclean_models
, is this intentional or is rclean
a typo and it should be reclean
?
The folder name with models is
/ICAFIX/rclean_models
, is this intentional or isrclean
a typo and it should bereclean
?
Hi Jure, this is intentional. The reason is that in the next release of Young Adult and Lifespan datasets, we add substring "rclean" to fMRI dense timeseries if it has been through the "sICA reclean" process, and substring "tclean" if it has been through the "temporal ICA cleanup" process. So, I keep the consistency here for the model folder.
For instance,
A latest update about the reclean pipeline:
Can we use 1 as the native environment and 2 as the singularity call?
Thanks,
Matt.
From: Chunhui Yang @.> Reply-To: Washington-University/HCPpipelines @.> Date: Monday, December 18, 2023 at 1:32 PM To: Washington-University/HCPpipelines @.> Cc: "Glasser, Matt" @.>, Mention @.***> Subject: Re: [Washington-University/HCPpipelines] the sICA reclean pipeline (PR #279)
A latest update about the reclean pipeline:
— Reply to this email directly, view it on GitHubhttps://github.com/Washington-University/HCPpipelines/pull/279#issuecomment-1861421892, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALSNTHFU6WKFX4HAPKPHP3YKCK3ZAVCNFSM6AAAAABADXIMC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRRGQZDCOBZGI. You are receiving this because you were mentioned.Message ID: @.***>
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Can we use 1 as the native environment and 2 as the singularity call? Thanks, Matt. From: Chunhui Yang @.> Reply-To: Washington-University/HCPpipelines @.> Date: Monday, December 18, 2023 at 1:32 PM To: Washington-University/HCPpipelines @.> Cc: "Glasser, Matt" @.>, Mention @.> Subject: Re: [Washington-University/HCPpipelines] the sICA reclean pipeline (PR #279) A latest update about the reclean pipeline: Volume+MSMSulc+MSMAll will all be cleaned a "rclean" substring will be added to the recleaned files now it supports 1) a singularity call of python command, 2) a python command from native environment (installed via conda as instructed in ICAFIX/README.md) an example script to demonstrate the usage Examples/Scripts/ApplyAutoRecleanPipelineBatch.sh — Reply to this email directly, view it on GitHub<#279 (comment)>, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AALSNTHFU6WKFX4HAPKPHP3YKCK3ZAVCNFSM6AAAAABADXIMC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNRRGQZDCOBZGI. You are receiving this because you were mentioned.Message ID: @.> … ____ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
Hi Matt, the choice of the call type is not ordered by 1 or 2. It is controlled by the argument --python-singularity
. If this argument is not specified then it will choose to run native environment with whatever python interpreter specified by Tim's new suggestion --python-interpreter
or just default native environment if it is not specified either.
If we internally support singularity, we might add support for something else in the future, making this option no longer binary. So, it should probably be named --python-mode
, with initial possibilities of native or singularity. Using numbers to represent them may not be the best choice, as strings would be more obvious.
Background
The sICA+FIX used for the official release of HCP-YA, HCP-D and HCP-A has imperfect component predictions. FIX also does not scale efficiently when applied to larger datasets due to the limitations imposed by the SVM classifiers. A reclean pipeline aims to correct these predictions based on several lightweight base learners trained on large-scale sICA components.
Main workflow
ICAFIX/ApplyAutoRecleanPipeline.sh
ICAFIX/ApplyAutoReclean.sh
computes additional sICA features usingICAFIX/scripts/computeRecleanFeatures.m
, inferences using pretrained modelsICAFIX/scripts/RecleanClassifierInference.py
, and generate the reclassification files for signal and artifacts;ICAFIX/ApplyHandReClassifications.sh
applies the new reclassification files from better models;ICAFIX/ReApplyFixPipeline.sh
runs the cleanup stage.Pretrained models
The pretrained models are saved under folder
ICAFIX/rclean_models
in the format ofjoblib
from python.Evaluation
The tests are run for a single-run reclean and a multi-run reclean, and they both finished smoothly.
SR:
MR:
Next step