Small-scale Test - Githubissues

wbenoit26 commented 5 months ago

Installation

Clone the repository: git@github.com:ML4GW/aframe.git
Install a local version of miniconda. If you already have a local version of miniconda, you can skip this step; however, if your conda version is > 4.X, then you'll encounter some minor bugs with our pipelining tool, pinto. This has been solved with Aframev2, but we won't get into that here. If you don't have a local miniconda, download the installer withwget https://repo.anaconda.com/miniconda/Miniconda3-py39_4.12.0-Linux-x86_64.sh and run bash Miniconda3-py39_4.12.0-Linux-x86_64.sh, following the instructions as prompted.
If not already in the base environment, run conda activate base
Install poetry with python -m pip install "poetry>1.2.0,<1.3.0". Again, later versions of poetry will break pinto, but should be functional otherwise.
Install pinto with python -m pip install git+https://github.com/ML4GW/pinto@main

Setting up to run

cd into aframe/projects/sandbox. There, create a file called .env with the following content:

BASE_DIR=/home/<username>/aframe/my-first-run
DATA_DIR=/home/<username>/aframe/data
HDF5_USE_FILE_LOCKING=FALSE
LIGO_USERNAME=<username>
LIGO_GROUP="ligo.dev.o4.cbc.explore.test"

To speed up training and testing, we'll make a few changes to the config file. Open the pyproject.toml file in the sandbox directory and make the following changes:

Comment out (with a #) lines 10 and 11. This plotting functionality isn't currently used.
On line 27, set train_stop = 1240679783
On line 28, set test_stop = 1240779783
On line 36, set Tb = 1
On line 118, set num_signals = 10000
On line 135, set max_epochs = 10
On line 137, set lr_ramp_epochs = 0

Running the pipeline

Ensure that you are on a node with enterprise-level GPUs, such as the DGX nodes on LLO and LHO. If the GPUs are too occupied, you may not be able to use them; you can check the utilization with nvidia-smi. If only some of the GPUs are occupied, you can specify which GPUs you want to use with export CUDA_VISIBLE_DEVICES=<gpu numbers>, e.g., export CUDA_VISIBLE_DEVICES=0,1,3,4,7.
If you have the appropriate conda and poetry version, you can just run pinto run from the sandbox directory. This will build all the necessary environments and run all of the tasks in a row. If you don't have the correct versions, pinto will be interrupted each time it creates an environment or completes a task. In this case, once a task is finished, you can just comment it out of the pipeline in lines 3-9 of the pyproject.toml and run pinto run again. Because it may take some time to create the necessary environments in addition to running the pipeline, it's recommended to run this in tmux so you don't get interrupted. Without environment creation, this took about 40 minutes for me to run on a DGX node.

Examining outputs

All the data used for training and testing the model will be found in /home/<username>/aframe/data. In data/train, there will be a background directory containing the segments of strain data used for training, and a signals.h5 file containing the waveform polarizations, as well as their corresponding parameters. In data/test, there will be a background directory containing the segments of strain data used for testing, as well as two files, rejected-parameters.h5 and waveforms.h5. The former contains parameters of waveforms that were rejected for having SNR < 4. The latter contains the testing waveforms along with their parameters.
The trained model and the results of inference will be found in /home/<username>/aframe/my-first-run. In my-first-run/training, the history.h5 file contains the training loss and validation score of the model for each epoch and weights.pt contains the weights of the trained model. In my-first-model/infer, the background.h5 and foreground.h5 contain the detection statistics corresponding to background events and injected signals, respectively. Also in my-first-model/infer, the sensitive-volume.html file is a plot of the sensitive volume of the model at different mass bins, corresponding to the data in sensitive-volume.h5. If all goes well, the plot should look like this. You can ignore the lines for the other pipelines on the plot.

VasSkliris commented 5 months ago

Comments about instructions: I think it would be usefull to say which conda enviroment should the installation happen into (also how to check if miniconda is already there).

num_waveforms = 10000, was already there but num_signals instead. Is that an issue?

wbenoit26 commented 5 months ago

Each project (datagen, train, export, infer, plots) will have its own environment created. For datagen, a conda environment called aframe-datagen will be created, and poetry will install its packages into that environment. For the others, a poetry virtual environment will be created. These environments will be activated automatically when you do pinto run, so you can stay in the base environment the whole time.

Good point, that's a typo, I should have written num_signals. I'll edit the instructions.

wbenoit26 commented 5 months ago

Oh, and most likely if you already have a local version of conda installed, running which python will point to that directory. If it doesn't point to a local directory, you probably don't have your own version of conda.

VasSkliris commented 5 months ago

Is then miniconda needed assuming someone hase a conda environment? I think that was my first confusion

wbenoit26 commented 5 months ago

Now that you ask, I suppose that you probably don't. I'm just used to having miniconda, so I included it in the instructions. I can't say I've ever tried without it, but I think it would work.

ML4GW / aframe

Small-scale Test #485

Installation

Setting up to run

Running the pipeline

Examining outputs