Support of listing the dataset directory instead of hard-coding the bucket name from the run flags.
(1) and (2) are coming from the design doc. (3) is a bug identified from a conversation with the HNS team that the current hard-coded values prevent the benchmark from easily run against different bucket names.
Internal CL to update the README and the yaml file: cl/662264972.
Tested by
Setting per_step_interval to 1 second and confirming that the per step time is roughly 1 second with the exception of those steps whose data loading time takes longer.
Setting max_steps to make sure that the training can stop gracefully after the global step is met, and that the metrics are recorded correctly.
Next steps
Once this PR and the CL are merged, I'll build the image and upload to gcr.io/gcs-tess/distributed_pytorch_training_benchmark.
Once we start to move this to the new simpler framework, I'll add unit tests to cover the various features.
Features
per_step_interval
.max_steps
.(1) and (2) are coming from the design doc. (3) is a bug identified from a conversation with the HNS team that the current hard-coded values prevent the benchmark from easily run against different bucket names.
Internal CL to update the README and the yaml file: cl/662264972.
Tested by
per_step_interval
to 1 second and confirming that the per step time is roughly 1 second with the exception of those steps whose data loading time takes longer.max_steps
to make sure that the training can stop gracefully after the global step is met, and that the metrics are recorded correctly.Next steps