Store computational throughput and latency figures in repository

Related to #58, #52, and #51.

We should to add a continually-updated record of the examples/second, second/batch, and other statistics discussed in #51 to a new file docs/Benchmarking.md (or ComputationalEfficiency.md, etc.).

AFAIK, neither Kates-Harbeck et al (2019) or Svyatkovskiy (2017) discussed single-node or single GPU computational efficiency, since they focused on the scaling of multi-node parallelism (CUDA-aware MPI).

Given that we have multiple active users of the software distributed across the country (world?), it would be good for collaboration to provide easily-accessible metrics of performance expectations. The absence of these figures has already caused some confusion when we got access to V100 GPUs on the Princeton Traverse cluster.

We need to establish a benchmark or set of benchmarks for FRNN in order to measure and communicate consistent and useful metrics. E.g. we could store measurements from only a single benchmark consisting of 0D and 1D d3d signal data with our LSTM architecture on a single GPU/device with batch_size=256. Then, a user would have to extrapolate the examples/second to the simpler network but the longer average pulse lengths on JET if using jet_data_0d.

The conf.yaml configuration choices that have first-order effects on performance include:

Network architecture (LSTM vs. TCN vs. Transformer, inclusion of 1D data via convolutional layers, etc.)
Hyperparameters (number of layers, hidden units per layer, LSTM length, batch size, etc.)
Data set: pulse length of shots, number of features per timestep in the input vector, etc.

Similar to #41, these figures will be useless in the long run unless we store details of their context, including:

SHA1 of Git version of the repository
Conda environment
CUDA, MPI, etc. libraries (Apex?)
Specific hardware details, including computer name, interconnect, specific model of GPU

Summary of hardware we have/had/will have access to for computational performance measurements:

K80 (OLCF Titan, ALCF Cooley, Princeton Tiger 1)
P100 (Princeton Tiger 2)
V100 (Princeton Traverse, OLCF Summit)
Intel KNL 7230 (ALCF Theta)

Even when hardware is retired (e.g. OLCF Titan), it would be good to keep those figures for posterity.

[ ] Also store MPI scaling metrics as discussed in the above papers?
[ ] Track memory usage on the GPU?
[ ] examples/second and sec/batch should already be independent of the pulse length. I.e. the size of an example depends on the sampling frequency dt and LSTM length (T_RNN</> in the Nature paper). But the gross throughput statistics such as seconds/epoch could be normalized by pulse length.
[ ] This whole issue is focused on training speed, but should we track and store inference time? See #60.
- [ ] Also real-time inference via ONNX, keras2c, etc.?
- [ ] What about guarantee_preprocessed.py runtimes? Should at least quote a single approximate runtime expectation.

PPPLDeepLearning / plasma-python

Store computational throughput and latency figures in repository #59