rdar://120807382 Keep track of the best metric values throughout training

As we report e.g. the maximum accuracy, averaged over multiple seeds, in our benchmarking. This callback adds such new metrics. You specify names of already existing metrics, then a "best overall" transformation is added.

MetricNamePostfix could only be used with MetricName, not StringMetricName. I relaxed that constraint.

A metric called e.g. accuracy | best overall (avg) is confusing, (avg) is confusing in any case so I removed it. The metric itself defines if it is an average, not the fact that Weighted was used. It is not used for metrics in TF (KerasMetricValue is used), so (avg) is not present when using TF. just removing this to keep parity in the names with tf and pytorch

apple / pfl-research

rdar://120807382 Keep track of the best metric values throughout training #8