Enable logging of per-step metric

The current infrastructure for metrics/objective functions/evaluation layers is a mess that's hurting performance (see #632), so I wonder if this would be a good time to refactor. My proposed scheme is to shift the responsibility of computing metric statistics out of the metric class and into the "print metrics" callback. This has the advantage of allowing the user to print statistics with multiple intervals (e.g. once every 100 steps and once per epoch). In order to reduce unnecessary GPU synchronizations or allreduces, we can implement a class that holds a scalar and only performs communication when the value is requested (similar to how we handle weights gradients). The "print metrics" callback would need to store two "distributed scalar" objects (sum and count) per metric per execution mode, update their values at every mini-batch step, and divide them when it needs to print.

An alternative, kludgier approach is to create a new callback that bypasses the metric class and directly interacts with evaluation layers. This will duplicate the functionality in the metric class without needing to modify that infrastructure. It will probably have poor performance because it will impose extra GPU synchronizations. We could optimize further, but I'd say it would be better use of effort to do a proper refactor.

LLNL / lbann

Enable logging of per-step metric #1605