Increase detail when reporting computational throughput and latency

Currently, FRNN reports the following metrics related to computational speed/efficiency during training: Per step, per epoch:

Examples/sec
sec/batch
- % of batch time spent in calculation vs. synchronization
overall batch size = batch size per GPU x N_GPU

As we become more cognizant about our performance expectations for the code on various architectures, I think it would be valuable to make these metrics more informative.

[ ] At end of each epoch: summarize Min/Max, Mean, Std-dev of Examples/sec, sec/batch, % calc, % sync of all steps within that epoch
[ ] At end of all epochs: same statistics across epochs?
[ ] Add greater granularity of timing information within an epoch?

For the final performance metrics (at the end of all epochs), we should probably exclude the first epoch or so, due to TensorFlow invocation of cuDNN autotuner on the first invocation of tf.Session.run() when the undocumented environment variable TF_CUDNN_USE_AUTOTUNE=1. See https://github.com/tensorflow/tensorflow/blob/fddd829a0795a98b1bdac63c5acaed2c3d8122ff/tensorflow/core/util/use_cudnn.cc#L36 and https://stackoverflow.com/questions/45063489/first-tf-session-run-performs-dramatically-different-from-later-runs-why for an explanation

Although I wonder if the initial run that is thrown away in order to force compilation already accomplishes this? https://github.com/PPPLDeepLearning/plasma-python/blob/c82ba61e339882a5af10b1052edc0348e16119f4/plasma/models/mpi_runner.py#L549-L564 since it calls Keras train_on_batch().

Note, I have tested the effects of the TF_CUDNN_USE_AUTOTUNE variable on the https://github.com/tensorflow/benchmarks , specifically

srun -n 1 python tf_cnn_benchmarks.py --num_gpus=1 --batch_size=32 --model=resnet50 --variable_update=parameter_server

on Traverse V100s and TigerGPU P100s, and disabling the autotuner leads to a loss of about 10% performance:

P100:

Click to expand!

``` TensorFlow: 1.13 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 32 global 32 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server ========== Generating training model Initializing graph Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 223.9 +/- 0.0 (jitter = 0.0) 8.169 10 images/sec: 223.9 +/- 0.1 (jitter = 0.3) 7.593 20 images/sec: 223.7 +/- 0.1 (jitter = 0.3) 7.696 30 images/sec: 223.6 +/- 0.1 (jitter = 0.3) 7.753 40 images/sec: 223.6 +/- 0.1 (jitter = 0.3) 8.007 50 images/sec: 223.4 +/- 0.1 (jitter = 0.3) 7.520 60 images/sec: 223.3 +/- 0.1 (jitter = 0.4) 7.988 70 images/sec: 223.3 +/- 0.1 (jitter = 0.4) 8.028 80 images/sec: 223.4 +/- 0.1 (jitter = 0.4) 7.932 90 images/sec: 223.5 +/- 0.1 (jitter = 0.5) 7.848 100 images/sec: 223.5 +/- 0.1 (jitter = 0.5) 7.796 ---------------------------------------------------------------- total images/sec: 223.38 ---------------------------------------------------------------- ----------------------------------------------------------------------------- export TF_CUDNN_USE_AUTOTUNE=0 ----------------------------------------------------------------------------- TensorFlow: 1.13 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 32 global 32 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server ========== Generating training model Initializing graph Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 204.8 +/- 0.0 (jitter = 0.0) 8.169 10 images/sec: 204.8 +/- 0.0 (jitter = 0.1) 7.593 20 images/sec: 204.8 +/- 0.0 (jitter = 0.2) 7.696 30 images/sec: 204.8 +/- 0.1 (jitter = 0.2) 7.753 40 images/sec: 204.8 +/- 0.1 (jitter = 0.2) 8.007 50 images/sec: 204.8 +/- 0.0 (jitter = 0.2) 7.520 60 images/sec: 204.8 +/- 0.0 (jitter = 0.2) 7.988 70 images/sec: 204.8 +/- 0.1 (jitter = 0.2) 8.027 80 images/sec: 204.8 +/- 0.1 (jitter = 0.2) 7.931 90 images/sec: 204.9 +/- 0.1 (jitter = 0.3) 7.850 100 images/sec: 204.9 +/- 0.1 (jitter = 0.3) 7.797 ---------------------------------------------------------------- total images/sec: 204.77 ---------------------------------------------------------------- ```

V100:

Click to expand!

``` TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 32 global 32 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server ========== Generating training model Initializing graph Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 339.6 +/- 0.0 (jitter = 0.0) 8.169 10 images/sec: 339.9 +/- 0.1 (jitter = 0.4) 7.593 20 images/sec: 340.0 +/- 0.1 (jitter = 0.3) 7.696 30 images/sec: 340.1 +/- 0.1 (jitter = 0.4) 7.753 40 images/sec: 340.1 +/- 0.1 (jitter = 0.4) 8.007 50 images/sec: 340.0 +/- 0.1 (jitter = 0.4) 7.519 60 images/sec: 340.0 +/- 0.1 (jitter = 0.4) 7.988 70 images/sec: 340.0 +/- 0.1 (jitter = 0.5) 8.027 80 images/sec: 340.0 +/- 0.1 (jitter = 0.5) 7.931 90 images/sec: 340.0 +/- 0.1 (jitter = 0.5) 7.849 100 images/sec: 340.0 +/- 0.1 (jitter = 0.5) 7.797 ---------------------------------------------------------------- total images/sec: 339.79 ---------------------------------------------------------------- ----------------------------------------------------------------------------- export TF_CUDNN_USE_AUTOTUNE=0 ----------------------------------------------------------------------------- TensorFlow: 1.14 Model: resnet50 Dataset: imagenet (synthetic) Mode: training SingleSess: False Batch size: 32 global 32 per device Num batches: 100 Num epochs: 0.00 Devices: ['/gpu:0'] NUMA bind: False Data format: NCHW Optimizer: sgd Variables: parameter_server ========== Generating training model Initializing graph Running warm up Done warm up Step Img/sec total_loss 1 images/sec: 312.1 +/- 0.0 (jitter = 0.0) 8.169 10 images/sec: 312.1 +/- 0.1 (jitter = 0.2) 7.593 20 images/sec: 312.1 +/- 0.1 (jitter = 0.2) 7.696 30 images/sec: 312.1 +/- 0.1 (jitter = 0.3) 7.753 40 images/sec: 312.2 +/- 0.1 (jitter = 0.3) 8.007 50 images/sec: 312.2 +/- 0.1 (jitter = 0.3) 7.520 60 images/sec: 312.1 +/- 0.1 (jitter = 0.3) 7.989 70 images/sec: 312.1 +/- 0.1 (jitter = 0.4) 8.026 80 images/sec: 312.1 +/- 0.1 (jitter = 0.4) 7.932 90 images/sec: 312.1 +/- 0.0 (jitter = 0.4) 7.849 100 images/sec: 312.1 +/- 0.0 (jitter = 0.4) 7.795 ---------------------------------------------------------------- total images/sec: 312.01 ---------------------------------------------------------------- ```

PPPLDeepLearning / plasma-python

Increase detail when reporting computational throughput and latency #51