Neutron-Calibration-in-DUNE / Blip

Blip ML/TDA package for LArTPCs.
0 stars 0 forks source link

Blip

Join the chat at https://gitter.im/Neutron-Calibration-in-DUNE/Blip CodeFactor DockerPulls PyPi version Project Status: Active – The project has reached a stable, usable state and is being actively developed. Documentation Status

Blip is a collection of machine learning tools for reconstructing, classifying and analyzing low energy (< MeV) interactions in liquid argon time projection chambers (LArTPCs). These interactions leave small point like signals (commonly referred to as "blips", hence the name). Blip is a python package which can be installed locally, or on the Wilson cluster, by following the directions below (eventually Blip will be available on the Wilson cluster without the need to install).

Table of Contents

  1. Getting the Repository
  2. Quick Start
  3. Building Blip From Source
  4. Usage
  5. Configuration Files
  6. Running Modules
  7. Event Display
  8. Versioning
  9. Contact (Authors)
  10. Citation
  11. License
  12. Support

Getting the Repository

In the terminal, one can clone this repository by typing the command:

git clone https://personal_username@github.com/Neutron-Calibration-in-DUNE/Blip.git

This uses the HTTPS protocol. For environments (e.g. computing clusters) where one has to use the SSH protocol:

git clone git@github.com:Neutron-Calibration-in-DUNE/Blip.git

Anyone in the "Neutron-Calibration-in-DUNE" organization should be able to develop (push changes to the remote repository).

Please contact Nicholas Carrara or David Rivera about becoming involved in development before merging with the master branch.

Quick Start

There are several run-time parameters that Blip configures at the start. These include, Parameter Usage
/local_scratch directory for storing data created at run time (log files, checkpoints, model parameters, plots, etc.)
/local_data directory for the input data for the module
/local_blip directory for custom Blip code and config files

Local Docker Install

The easiest way to run Blip is to grab the docker container. First, you must install docker and start it up using the commands,

sudo apt-get update
sudo apt-get install docker.io
sudo systemctl start docker
sudo systemctl enable docker

Then, we can grab the blip container with the following:

docker pull infophysics/blip:latest

To run the image using the blip_display and gpus, there are various command line parameters that must be set,

docker run --it --gpus all -p 5006:5006 blip

where the --gpus all command tells docker to forward GPU access and -p 5006:5006 port forwards the local 5006 port in the container to the local host 5006 port.

To access the container with ssh support from the local host, do the following:

docker run -it --rm -e "USER_ID=$(id -u)" -e GROUP_ID="$(id -g)" \
      -v "$HOME/.ssh:/home/builder/.ssh:rw" \
      -v "$SSH_AUTH_SOCK:/ssh.socket" -e "SSH_AUTH_SOCK=/ssh.socket" \
      --gpus all -p 5006:5006 blip

Wilson Cluster Docker Install

The Wilson Cluster at Fermilab (WC) uses the apptainer module for downloading and using containers. Instructions for how to use this module can be found here. A script for installing Blip using apptainer can be found in the accompanying BlipModels repository. Following the instructions from the WC site, one can set up and download Blip using the following commands,

module load apptainer
export APPTAINER_CACHEDIR=/wclustre/my_project_dir/apptainer/.apptainer/cache
apptainer build /wclustre/my_project_dir/blip.sif docker://infophysics/blip:latest

The container can then be spun up in an interactive node by issuing the command:

apptainer shell --nv /wclustre/my_project_dir/blip.sif

Wilson Cluster Blip jobs

Perlmutter Docker Install

The Perlmutter system at NERSC uses shifter for downloading and using containers. Instructions for how to use shifter on NERSC can be found here. A script for installing Blip using shifter can be found in the accompanying BlipModels repository. Following the instructions from the Perlmutter site, one can set up and download Blip using the following commands,

shifterimg -v pull docker:infophysics/blip:latest

The container can then be spun up in an interactive node by issuing the command:

shifter --image=docker:infophysics/blip:latest bash

Perlmutter Blip Jobs

To run a job using Blip, one simply needs to specify the job parameters in a bash script like the following:

#!/bin/bash
#SBATCH -A dune                 # account to use for the job, '--account', '-A'
#SBATCH -J example              # job name, '--job-name', '-J'
#SBATCH -C gpu                  # type of job (constraint can be 'cpu' or 'gpu'), '--constraint', '-C'
#SBATCH -q shared               # Jobs requiring 1 or 2 gpus should use the shared setting, all others use 'regular'
#SBATCH -t 1:00:00              # amount of time requested for the job, '--time', 't'
#SBATCH -N 1                    # number of nodes, '--nodes', '-N'
#SBATCH -n 1                    # number of tasks '--ntasks', -n'
#SBATCH -c 32                   # number of cores per task, '--cpus-per-task', '-c'
#SBATCH --gpus-per-task=1       # number of gpus to be used per task
#SBATCH --gpus-per-node=1       # number of gpus per node.
#SBATCH --gpu-bind=none         # comment this out if you don't want all gpus visible to each task

# Blip settings
#SBATCH --image=docker:infophysics/blip:latest  
#SBATCH --volume="/pscratch/sd/<first_initial>/<user>:/local_scratch;/global/cfs/cdirs/dune/users/<user>/<custom_blip_code>:/local_blip;/global/cfs/cdirs/dune/users/<user>/<local_data>;/local_data"

shifter arrakis /local_blip/my_config.yaml

The volumes local_scratch, local_blip and local_data must be written explicitly when using the #SBATCH command, so make sure you don't put environment variables in batch jobs, otherwise it may not work correctly. The config file is specified after the program command, which in this case is arrakis. For development purposes, it is recommended to use the -q specification shared, rather than regular (regular freezes out gpus/nodes to a single user, which is more costly to experimental budgets and should only be used for final optimizations of models).

Building Blip From Source

Environment YAML

The easiet way to install is to create a conda environment dedicated to the API using the packages defined in environment_blip.yml:

conda env create -f environment_blip.yml
conda activate blip

You can optionally add the flag -n <name> to specify a name for the environment.

MinkowskiEngine

Due to the nature of the large datasets generated from LArTPC data, parts of Blip make use of SparseTensors in order to be more memory efficient, and to speed up overall performance. SpraseTensors are handled through the MinkowskiEngine package, which interfaces with pytorch. With the libopenblas dependency, we can install MinkowskiEngine via the following

sudo apt-get install libopenblas-dev
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas"

You may need to switch to a different version of GCC in order to install CUDA. To do this, switch to the older version with:

sudo apt -y install gcc-11 g++-11
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-11 11
sudo update-alternatives --install /usr/bin/g++ g++ /usr/bin/g++-11 11

You'll then need to select the alternative version

$ sudo update-alternatives --config gcc

There are 2 choices for the alternative gcc (providing /usr/bin/gcc).

  Selection    Path             Priority   Status
------------------------------------------------------------
* 0            /usr/bin/gcc-12   12        auto mode
  1            /usr/bin/gcc-11   11        manual mode
  2            /usr/bin/gcc-12   12        manual mode

Press <enter> to keep the current choice[*], or type selection number: 1

BLIP

From the main folder of Blip you can run:

pip install .

which should install the API for you.

Wilson Cluster

To install BLIP on the Wilson cluter at FNAL, we first need to set up our conda environment. Due to the limited size of the home directory, we want to tell anaconda to download packages and install blip in a different directory. Once logged in to the Wilson cluster, do the following to activate gnu8, openblas, cuda and condaforge

module load gnu8/8.3.0
module load openblas/0.3.7
module load cuda11/11.8.0
module load condaforge/py39

The default anaconda location for packages and environments is usually the home directory, which has limited space. To check which directories are set, run

conda config --show

which should give an output like the following:

[<user_name>@wc:~:]$ conda config --show
...
envs_dirs:
  - /nashome/<first_letter>/<user_name>/.conda/envs
...
pkgs_dirs:
  - <old_package_directory>
...

Then, tell anaconda to use a different directory for downloading packages:

conda config --remove pkgs_dirs <old_package_directory>
conda config --remove envs_dirs <old_env_directory>
conda config --add pkgs_dirs <package_directory>
conda config --add envs_dirs <env_directory>

I've used /wclustre/dune/ as the package/env directory. Then, install blip using the YAML file:

conda env create --prefix <blip_install_directory> -f environment_blip.yml

I've also used /wclustre/dune//blip as the install directory. Once installed, blip can be activated with

conda activate <blip_install_directory>

In order to install MinkowskiEngine with CUDA, we need to set an environment variable which specifies the number of architectures that the current version of cuda will work with. On a generic linux system, this can be achieved with a small script:

CUDA_VERSION=$(/usr/local/cuda/bin/nvcc --version | sed -n 's/^.*release \([0-9]\+\.[0-9]\+\).*$/\1/p')
if [[ ${CUDA_VERSION} == 9.0* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;7.0+PTX"
elif [[ ${CUDA_VERSION} == 9.2* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0+PTX"
elif [[ ${CUDA_VERSION} == 10.* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5+PTX"
elif [[ ${CUDA_VERSION} == 11.0* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0+PTX"
elif [[ ${CUDA_VERSION} == 11.* ]]; then
    export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX"
else
    echo "unsupported cuda version."
    exit 1
fi

For our purposes however, we are choosing cuda 11.8.0, so we can just run the command

export TORCH_CUDA_ARCH_LIST="3.5;5.0;6.0;6.1;7.0;7.5;8.0;8.6+PTX"

Then, install MinkowskiEngine

conda install openblas
pip install -U git+https://github.com/NVIDIA/MinkowskiEngine -v --no-deps --install-option="--blas_include_dirs=${CONDA_PREFIX}/include" --install-option="--blas=openblas" --install-option="--force_cuda"

Usage

Blip can be used in three different ways,

Modules

There are several programs that will run different tasks such as; training a neural network, running a TDA or clustering algorithm, performing some analysis, etc. Each of these tasks are specified by a module_type and a corresponding module_mode. For example, to train a neural network one would set in the configuration file:

# example module section
module:
  module_name:  'training_test'
  module_type:  'ml'            # ml, clustering, tda, analysis, ...
  module_mode:  'training'      # training, inference, parameter_scan, ...
  gpu:          True
  gpu_device:   0

Event Display

Using Custom Code with Blip

Many of the classes in Blip are built from an abstract class with the prefix 'Generic'. Any user can inherit from these classes and making sure to override the required functions. These custom classes can then be loaded to Blip at runtime by specifying the python files in their appropriate config section.

Configuration Files

Running Modules

Machine Learning

Clustering and Manifold Learning

Topological Data Analysis

Analysis

Using the Event Display

Versioning

For the versions available, see the tags on this repository.

Contact (Authors)

If you have questions, please contact Nicholas Carrara, nmcarrara@ucdavis.edu.

See also the list of contributors who participate in this project.

See AUTHORS.md for information on the developers.

Support

Citation

When you use blip, please say so in your slides or publications (for publications, see Zenodo link above). This is important for us being able to get funding to support this project.