The catalyst-bubble-detection
project aims to use ML to detect bubbles in high-throughput microscope images of catalyst surfaces. This repo contains the full ML pipeline for training, testing, evaluating, and optimizing ML models for bubble detection. The goal of this work is to screen catalysts to choose the most optimal one for a given gaseous reaction (e.g. oxygen evolution and oxygen reduction reactions).
pip install -r requirements.txt
The install installs PyTorch 1.8.2 and Torchvision 0.9.2 (latest LTS release), and other required packages.
This package is designed to be used with labeled data from Darwin V7 Labs. Darwin V7 saves the labels of annotated data in the form of V7 JSON. Data should be organized in images
and json
folders as such:
root/
├─ images/
│ ├─ image1.jpg
│ ├─ image2.jpg
├─ json/
│ ├─ image1.json
│ ├─ image2.json
Names for images and their associated labels must be the same, with the exception of the file extension.
If using a 3D semantic segmentation model (i.e. unet
or fcdensenet
), images sorted lexicographically are expected to be in order for a volume. Separate volumes should be in separate subfolders. Images should be in a Im
subdirectory, and labels should be in a L
subdirectory. Label files should have an extra _L
appeneded to their file name. For instance:
root/
├─ vol1/
│ ├─ Im/
│ │ ├─ zslice01.png
│ │ ├─ zslice02.png
│ │ ├─ zslice03.png
│ ├─ L/
│ │ ├─ zslice01_L.png
│ │ ├─ zslice02_L.png
│ │ ├─ zslice03_L.png
├─ vol2/
│ ├─ Im/
│ │ ├─ zslice01.png
│ │ ├─ zslice02.png
│ │ ├─ zslice03.png
│ ├─ L/
│ │ ├─ zslice01_L.png
│ │ ├─ zslice02_L.png
│ │ ├─ zslice03_L.png
python main.py --mode gen_targets train --config configs/test_config.json
Configurations change the behavior and settings of the application, and can be set in two ways, via command line flags or via configuration files. The order of configuration precedence is:
Generally, configurations should contain some key parameters:
root
: The root directory containing the data to be usedname
: The name of the model to be used (the model file will be saved automatically to saved/<model_name>.pth
)gpu
: The GPU to use when running the model. Use -1 for CPU only training.model
: The type of model to use. See Available Models for a list of options.Additionally, each run should have one or more modes given as a command line argument. Most commonly used modes include 'train'
, 'apply'
, 'evaluate'
, and 'optimize'
. See Detailed Usage for more information on available modes.
--root ROOT Root directory for the dataset. Default: data/bubbles
--lr [LR ...] Learning rate for the optimizer. Default: 0.0001
--batch_size [BATCH_SIZE ...]
Batch size. Default: 1
--epoch [EPOCH ...] Number of epochs to train for. Default: 10
--gpu GPU Device ordinal for CUDA GPU. Set to -1 to run on CPU only. Default: -1
--amp Use mixed precision (reduces VRAM requirements). Default: False
--opt [OPT ...] Optimizer to train with. Either 'adam' or 'sgd'. Default: adam
--momentum [MOMENTUM ...]
Momentum for SGD. Default: 0.9
--test_dir TEST_DIR Directory to test model on. Default: test
--save SAVE Directory to write saved model weights to. Subdirectories will be created for HPO runs. Default: saved
--mode MODE [MODE ...]
Mode to run. Available modes include 'train', 'apply', 'evaluate', 'optimize', 'gen_labels', 'augment',
'gen_targets', 'image2npy', 'metrics', and 'stitch'
--mp Whether or not to use multiprocessing. Only use if you are training with multiple GPUs or multiple nodes. Default:
False
--nodes NODES Number of nodes for multiprocessing. Only necessary if mp is set.
--nr NR Index of the current node's first rank. Only necessary if mp is set.
--transforms [TRANSFORMS ...]
Which augmentations to use. Choose some combination of 'vertical_flip', 'horizontal_flip', and 'rotation'. Default:
['rotation', 'horizontal_flip', 'vertical_flip']
--model [MODEL ...] Which model architecture to use. Available models include 'mask_rcnn', 'faster_rcnn', 'retina_net', and
'simclr_faster_rcnn'. Default: faster_rcnn
--config CONFIG Location of a full configuration file. This contains options for a given run.
--json_dir JSON_DIR Directory containing several JSON configuration files.
--num_images NUM_IMAGES
Number of images to train on
--checkpoint CHECKPOINT
Filepath to a set of saved weights (pth file) to load
--no-prompt Skip prompt for checkpoint loading. Default: False
--num_samples NUM_SAMPLES
Number of trials to run for HPO (only relevant if mode is 'optimize')
--resume_hpo Whether or not to resume an HPO trial
--jobs JOBS Number of cores to use for parsl
--data_workers DATA_WORKERS
Number of processes to run for dataloading
--name NAME Name of the model
--graph Whether or not to save a plot of the results
--run_dir RUN_DIR Directory to use for parsl
--loss [LOSS ...] Loss function to use.
--version VERSION Version of the network (for semantic compatibility)
--patience PATIENCE Use patience after how many epochs
--image_size IMAGE_SIZE IMAGE_SIZE
Size of images processed as x y (necessary for evaluate re-stitching)
--num_patches NUM_PATCHES [NUM_PATCHES ...]
Number of patches each slice is split into. Set to 0 for automatic choice based on patch and image size.
--overlap_size [OVERLAP_SIZE ...]
Width/height of the overlap between each patch
--slices [SLICES ...]
Number of slices to take at a time
--no-overlay On evaluate, create full stitched images rather than overlays (if applicable)
--data-split SPLIT SPLIT SPLIT
3 floats for test validation train split of data in image2npy (in order test, validation, train)
--collect Whether or not to manually garbage collect each input
--patching_mode [PATCHING_MODE ...]
Mode with which to patch images. Either "grid" or "random"
--clear_predictions Clear prediction patches during evaluate
--tar Tar output when done
--mask MASK Mask to apply when getting metrics
--train_set TRAIN_SET
Dataset to use only as train
--test_set TEST_SET Dataset to use only as test
--val_set VAL_SET Dataset to use only as validation
--blocks BLOCKS Number of encoding blocks to use in each architecture
--patch_size [PATCH_SIZE ...]
Size of patches to use for target generation. Set to -1 to disable patching
--data_split SPLIT SPLIT SPLIT
3 floats for test validation train split of data in image2npy (in order test, validation, train)
--gamma [GAMMA ...] Amount to decrease LR by every 3 epochs
--imagenet_stats Use ImageNet stats instead of dataset stats for normalization
--stats_file STATS_FILE
JSON file to read stats from
--video VIDEO Video file to perform inference on
--video_dir VIDEO_DIR
Directory containing videos. Recursively searches for videos
--simclr_checkpoint SIMCLR_CHECKPOINT
Weights for SimCLR pre-trained ResNet
--augment_out AUGMENT_OUT
Output directory for augment mode
Any flags can also be included in a configuration file. The configuration file is a JSON file with key-value pairs for each flag and its associated value. This can then be used with the --config
flag at runtime. For instance, an example maskrcnn_test.json
config may contain the following:
{
"root": "data/pristine-full",
"model": "mask_rcnn",
"aug": true,
"lr": 1e-04,
"opt": "adam",
"transforms": ["horizontal_flip", "gray_balance_adjust", "blur", "sharpness", "vertical_flip", "rotation"],
"epochs": 15,
"gpu": 0,
"amp": true,
"batch_size": 8,
"patch_size": 500,
"prompt": false,
"jobs": 8,
"name": "maskrcnn_test"
}
The general workflow starts with preprocessing via generating target data, training the model, then evaluating or applying (e.g. running inference with) the model.
Generally, most configuration parameters for all modes needed for a specific run are added to one configuration file. We provide a sample configuration file in configs/sample.json
that shows the usage of many of the different configurations. A run using this configuration would then look like:
python main.py --config configs/sample.json --mode <mode or list of modes>
For improved results, we recommend using hyperparameter optimization (HPO) to find the optimal parameters for each given run. A sample configuration file demonstrating the different options for HPO is shown in configs/sample_HPO.json
, which can then be run using:
python main.py --config configs/sample_HPO.json --mode optimize
The gen_targets
mode is used to generate target data for model training: images and associated .npy files containing the associated bubble masks and boxes.
The splits
configuration option, a list of 3 floats, is used to determine the [test, validation, train] splits of the data. It defaults to [0.1, 0.2, 0.7]
. The images in each split are chosen randomly.
To designate specific images for use in the train, test, and validation sets, place these images in separate folders (each containing images
and json
folders), and use the train_set
, test_set
, and val_set
configuration options.
gen_targets
uses multiprocessing for accelerated runs. The number of threads utilized can be specified using the jobs
configuration option.
Images can also be patched using the patch_size
configuration option during this step. This is
very useful for large images that may not easily fit in available RAM or GPU VRAM. Not including a patch size, or setting the patch_size to -1, will not split the image into patches.
gen_targets
will often prompt the user, for instance if previous generated data will be overwritten. This can be bypassed with the --no-prompt
command line argument or setting the prompt
configuration option to False
.
For semantic segmentation, generating target data is done using image2npy
rather than gen_targets
(see Miscellaneous Mode Options).
The train
mode is used to train models. Necessary training parameters can be set through the epoch
, opt
(optimizer), lr
(learning rate), and batch_size
configuration options.
Other options for training parameters include gamma
, mp
, amp
, momentum
, patience
, and collect
.
Various train-time augmentations can be added to greater vary the inputted data. This can be added by a list of options to the transforms
configuration option and include:
horizontal_flip
vertical_flip
rotation
gray_balance_adjust
(changes brightness, contrast, and saturation)blur
(uses a Gaussian blur)sharpness
saltandpepper
(adds salt and pepper noise)log_gamma
(Log Gamma contrast adjustment)clahe
(Contrast Limited Adaptive Histogram Equalization)contrast_stretch
Training can also be done starting with a previous checkpoint file with pretrained weights. The checkpoint
configuration option enables the use of pretrained weights. An example of using a checkpoint is shown in configs/sample_HPO.json
.
The evaluate
mode runs the model on the test set of images, and outputs the resulting loss, Intersection over Union (IoU), and Mean Average Percision (mAP) scores. evaluate
is effectively a call to apply
followed by a call to metrics
.
The apply
mode runs the model on a set of unlabeled images. This mode can also be used to apply the model to a video file, where the model is run on each frame separately. Use the video
configuration option to run on one video, or video_dir
to run on a directory of videos. When using a video dataset, results are saved to the same directory that contains the video file(s), in a subdirectory named the same as the video file in question. Within this subdirectory is a subdirectory with the model name, which then contains the labeled frames, CSV files with the resulting bubbles, and the reconstructed video with the bubbles labeled.
To only obtain metrics for generated outputs, use the metrics
mode.
Hyperparameter optimization (HPO) enables the utility to find a combination of parameters, such as learning rate, number of epochs, and batch size, that provide the best model performance. HPO is run using the optimize
mode, and requires minimum and maximum values to search between for some parameters, and a list of options to choose from for others.
When creating a HPO configuration, the following arguments should have two values for the minimum and maximum values respectively:
lr
epoch
momentum
patch_size
batch_size
gamma
loss
The model
and opt
arguments should be a list of different options for models and optimizers, respectively. Augmentations will be sampled from all augmentations listed in the transforms
option.
The best option will be selected based on the best mAP score. The script generates a subdirectory within the saved
directory with the best model stored as best.pth
, and the associated configuration file saved as best.json
.
Various other modes are available as well:
gen_labels
: Generate labels for use in semantic segmentation from V7Labs instance JSON files.augment
: Apply augmentations to images and export to the directory specified in the augment_out
configuration option.image2npy
: The analog of gen_targets
for semantic segmentation. The first two dimensions for each volume is set through patch_size
, and the third dimension is set through slices
.stitch
: For use primarily in semantic segmentation, stitches patches together to create a full image.mask_rcnn
faster_rcnn
retina_net
simclr_faster_rcnn
mask_rcnn
and faster_rcnn
both also support version 1 and version 2 weights. Use the model_version
arguments to switch between models. Version 1 is used by default for backwards compatability.
unet
fcdensenet
unet2d
Unet and FCDenseNet are 3D implementations.