IBM / mi-prometheus

Enabling reproducible Machine Learning research
http://mi-prometheus.rtfd.io/
Apache License 2.0
42 stars 18 forks source link
grid-worker machine-learning mi-prometheus model problem pytorch worker

Machine Intelligence: Prometheus

Bringing (Py)Torch To Mankind

Language GitHub license GitHub version

Build Status Language grade: Python Total alerts Maintainability

Documentation Status Gitter chat

Description

MI-Prometheus (Machine Intelligence - Prometheus), an open-source framework aiming at accelerating Machine Learning Research, by fostering the rapid development of diverse neural network-based models and facilitating their comparison. In its core, to accelerate the computations on their own, MI-Prometheus relies on PyTorch and extensively uses its mechanisms for the distribution of computations on CPUs/GPUs.

In MI-Prometheus, the training & testing mechanisms are no longer pinned to a specific model or problem, and built-in mechanisms for easy configuration management & statistics collection facilitate running experiments combining different models with problems.

A project of the Machine Intelligence team, IBM Research, Almaden.

Installation

PyTorch is the main library used by MI-Prometheus for tensors computations. Please refer to the official installation guide for PyTorch to install it. We currently do not officially support PyTorch >= v0.4.1 (especially the v1.0 preview), but intend to in the near future.

The recommended install procedure below assumes the creation of a new Anaconda environment.

  1. Install PyTorch 0.4.0.

    With CUDA support:

    conda install conda install pytorch=0.4.0 cuda90 -c pytorch # For CUDA 9

    Or CPU only:

    conda install pytorch-cpu=0.4.0 cpuonly -c pytorch
  2. Install PyYAML from Anaconda.
    conda install pyyaml
  3. Install MI-Prometheus
    python setup.py install

    Or if you are the developer, please call the following command instead:

    python setup.py develop

    This will enable you to change the code of existing problems/models/workers and run them by calling mip-* commands. More in that subject can be found in the setuptools documentation.

We mainly develop on Ubuntu 16.04, but MI-Prometheus should work on macOS (10.14) as well.

We will upload MI-prometheus to PyPI in the near future.

The dependencies of MI-prometheus are:

Core ideas

Core features

Workers

The workers are the main way you will use MI-Prometheus. They are parameterizable, OOP-designed scripts which will execute a specific task related to the supervised training or test of a Model on a Problem, following a Configuration.

foo@bar:~$ mip-offline-trainer --h
usage: mip-offline-trainer [-h] [--config CONFIG] [--model MODEL] [--gpu]
                           [--outdir OUTDIR] [--savetag SAVETAG]
                           [--ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                           [--li LOGGING_INTERVAL] [--agree]
                           [--tensorboard {0,1,2}] [--visualize {-1,0,1,2,3}]

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Name of the configuration file(s) to be loaded. If specifying more than one file, they must be separated with coma ",".
  --model MODEL         Path to the file containing the saved parameters of the model to load (model checkpoint, should end with a .pt extension.)
  --gpu                 The current worker will move the computations on GPU devices, if available in the system. (Default: False)
  --outdir OUTDIR       Path to the output directory where the experiment(s) folders will be stored. (DEFAULT: ./experiments)
  --savetag SAVETAG     Tag for the save directory
  --ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        Log level. (Default: INFO)
  --li LOGGING_INTERVAL
                        Statistics logging interval. Will impact logging to the logger and exporting to TensorBoard. Writing to the csv file is not impacted (interval of 1). (Default: 100, i.e. logs every 100 episodes).
  --agree               Request user confirmation just after loading the settings, before starting training  (Default: False)
  --tensorboard {0,1,2}
                        If present, enable logging to TensorBoard. Available log levels:
                        0: Log the collected statistics.
                        1: Add the histograms of the model's biases & weights (Warning: Slow).
                        2: Add the histograms of the model's biases & weights gradients (Warning: Even slower).
  --visualize {-1,0,1,2,3}
                        Activate dynamic visualization (Warning: will require user interaction):
                        -1: disabled (DEFAULT)
                        0: Only during training episodes.
                        1: During both training and validation episodes.
                        2: Only during validation episodes.
                        3: Only during the last validation, after the training is completed.
foo@bar:~$ mip-online-trainer --h
usage: mip-online-trainer [-h] [--config CONFIG] [--model MODEL] [--gpu]
                          [--outdir OUTDIR] [--savetag SAVETAG]
                          [--ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                          [--li LOGGING_INTERVAL] [--agree]
                          [--tensorboard {0,1,2}] [--visualize {-1,0,1,2,3}]

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Name of the configuration file(s) to be loaded. If specifying more than one file, they must be separated with coma ",".
  --model MODEL         Path to the file containing the saved parameters of the model to load (model checkpoint, should end with a .pt extension.)
  --gpu                 The current worker will move the computations on GPU devices, if available in the system. (Default: False)
  --outdir OUTDIR       Path to the output directory where the experiment(s) folders will be stored. (DEFAULT: ./experiments)
  --savetag SAVETAG     Tag for the save directory
  --ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        Log level. (Default: INFO)
  --li LOGGING_INTERVAL
                        Statistics logging interval. Will impact logging to the logger and exporting to TensorBoard. Writing to the csv file is not impacted (interval of 1). (Default: 100, i.e. logs every 100 episodes).
  --agree               Request user confirmation just after loading the settings, before starting training  (Default: False)
  --tensorboard {0,1,2}
                        If present, enable logging to TensorBoard. Available log levels:
                        0: Log the collected statistics.
                        1: Add the histograms of the model's biases & weights (Warning: Slow).
                        2: Add the histograms of the model's biases & weights gradients (Warning: Even slower).
  --visualize {-1,0,1,2,3}
                        Activate dynamic visualization (Warning: will require user interaction):
                        -1: disabled (DEFAULT)
                        0: Only during training episodes.
                        1: During both training and validation episodes.
                        2: Only during validation episodes.
                        3: Only during the last validation, after the training is completed.
foo@bar:~$ mip-tester --h
usage: mip-tester [-h] [--config CONFIG] [--model MODEL] [--gpu]
                  [--outdir OUTDIR] [--savetag SAVETAG]
                  [--ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                  [--li LOGGING_INTERVAL] [--agree] [--visualize]

optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Name of the configuration file(s) to be loaded. If specifying more than one file, they must be separated with coma ",".
  --model MODEL         Path to the file containing the saved parameters of the model to load (model checkpoint, should end with a .pt extension.)
  --gpu                 The current worker will move the computations on GPU devices, if available in the system. (Default: False)
  --outdir OUTDIR       Path to the output directory where the experiment(s) folders will be stored. (DEFAULT: ./experiments)
  --savetag SAVETAG     Tag for the save directory
  --ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        Log level. (Default: INFO)
  --li LOGGING_INTERVAL
                        Statistics logging interval. Will impact logging to the logger and exporting to TensorBoard. Writing to the csv file is not impacted (interval of 1). (Default: 100, i.e. logs every 100 episodes).
  --agree               Request user confirmation just after loading the settings, before starting training  (Default: False)
  --visualize           Activate dynamic visualization

Grid workers

Grid Workers manage several experiments ("grids") by reusing the workers, such as OfflineTrainer \& Tester. There are 3 types of Grid Workers:

foo@bar:~$ mip-grid-trainer-cpu --h
usage: mip-grid-trainer-cpu [-h] [--outdir OUTDIR] [--savetag SAVETAG]
                           [--ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                           [--li LOGGING_INTERVAL] [--agree] [--config CONFIG]
                           [--online_trainer] [--tensorboard {0,1,2}]

optional arguments:
  -h, --help            show this help message and exit
  --outdir OUTDIR       Path to the global output directory where the experiments folders will be / are stored. Affects all grid experiments. (DEFAULT: ./experiments)
  --savetag SAVETAG     Additional tag for the global output directory.
  --ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        Log level for the experiments. (Default: INFO)
  --li LOGGING_INTERVAL
                        Statistics logging interval. Will impact logging to the logger and exporting to TensorBoard for the experiments. Do not affect the grid worker. Writing to the csv file is not impacted (interval of 1). (Default: 100, i.e. logs every 100 episodes).
  --agree               Request user confirmation before starting the grid experiment.  (Default: False)
  --config CONFIG       Name of the configuration file(s) to be loaded. If specifying more than one file, they must be separated with coma ",".
  --online_trainer      Select the OnLineTrainer instead of the default OffLineTrainer.
  --tensorboard {0,1,2}
                        If present, enable logging to TensorBoard. Available log levels:
                        0: Log the collected statistics.
                        1: Add the histograms of the model's biases & weights (Warning: Slow).
                        2: Add the histograms of the model's biases & weights gradients (Warning: Even slower).
foo@bar:~$ mip-grid-tester-cpu --h
usage: mip-grid-tester-cpu [-h] [--outdir OUTDIR] [--savetag SAVETAG]
                          [--ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                          [--li LOGGING_INTERVAL] [--agree] [--n NUM_TESTS]

optional arguments:
  -h, --help            show this help message and exit
  --outdir OUTDIR       Path to the global output directory where the experiments folders will be / are stored. Affects all grid experiments. (DEFAULT: ./experiments)
  --savetag SAVETAG     Additional tag for the global output directory.
  --ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        Log level for the experiments. (Default: INFO)
  --li LOGGING_INTERVAL
                        Statistics logging interval. Will impact logging to the logger and exporting to TensorBoard for the experiments. Do not affect the grid worker. Writing to the csv file is not impacted (interval of 1). (Default: 100, i.e. logs every 100 episodes).
  --agree               Request user confirmation before starting the grid experiment.  (Default: False)
  --n NUM_TESTS         Number of test experiments to run for each model.
foo@bar:~$ mip-grid-analyzer --h
usage: mip-grid-analyzer [-h] [--expdir EXPDIR]
                         [--ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}]
                         [--li LOGGING_INTERVAL] [--agree]

optional arguments:
  -h, --help            show this help message and exit
  --expdir EXPDIR       Path to the directory where the experiments folders will be / are stored. Affects all grid experiments. (DEFAULT: ./experiments)
  --ll {CRITICAL,ERROR,WARNING,INFO,DEBUG,NOTSET}
                        Log level for the experiments. (Default: INFO)
  --li LOGGING_INTERVAL
                        Statistics logging interval. Will impact logging to the logger and exporting to TensorBoard for the experiments. Do not affect the grid worker itself. Writing to the csv file is not impacted (interval of 1). (Default: 100, i.e. logs every 100 episodes).
  --agree               Request user confirmation before starting the grid experiment. (Default: False)

NOTES:

Documentation

Documentation is created using Sphinx, and is available on readthedocs.io.

Getting Started

Contributing

You are encouraged if you would like to contribute! Please use the issues if you want to request a new feature or a fix, so that we can discuss it first.

The Team

HitCount