Improve training performance

This PR makes training models with CTLearn faster and more correct.

First, the input function has been restructured to use Dataset.from_tensor_slices() and Dataset.map() for simplicity and performance. With https://github.com/cta-observatory/dl1-data-handler/pull/88, using the num_parallel_calls argument of Dataset.map() works, enabling a ~40% speedup on my machine. In addition, tf.data.experimental.prefetch_to_device() is used to prefetch examples directly to GPU, providing another 20-25% speedup.

Second, the training and validation loop has been restructured to properly shuffle and use all of the data. Previously, only a small buffer of examples was shuffled and trained on during each iteration. In particular, when a random seed was set, the shuffling would be deterministic on each iteration, so the same set of examples would be used each time. As a result, only a small subset of the training set was actually used for training!

This PR fixes this problem by using tf.estimator.train_and_evaluate() to construct the training input function only once, combined with Dataset.repeat() to allow training for an indefinite number of epochs. Because of the previous changes to the input function, only the example identifiers are shuffled, allowing for a perfect shuffle for virtually any sized dataset. To support this API, the max_steps argument is used to determine the total number of steps to train, as opposed to the number of validations and steps between validations.

Finally, the now-obsolete configuration parameters shuffle_buffer_size, prefetch_buffer_size, num_validations, and num_training_steps_per_validation have been removed, and new parameters prefetch_to_device and max_steps have been added to support the new features.

ctlearn-project / ctlearn

Improve training performance #131