d = Dataset.list_files(pattern, shuffle=False)
d = d.shard(num_workers, worker_index)
d = d.repeat(num_epochs)
d = d.shuffle(shuffle_buffer_size)
d = d.interleave(tf.data.TFRecordDataset,
cycle_length=num_readers, block_length=1)
d = d.map(parser_fn, num_parallel_calls=num_map_threads)
This mean that each MPI process only manage a random set of files, instead of a random set of images from all files.
This change remove the performance bottleneck (~3GB/s).
Test using a RAMDISK as a filesystem with 25 simulated H100 & 4 reader threads for resnet50 (MLPerf).
Everything is in RAM for this test
Above >30 GPU the "with PR" should work (>90% AU) but it hangs before giving the final result.
I think it is simply due to the fact that I should try with a larger dataset. DLIO profiler is stuck something like this.
Important: This is a draft and I haven't really verified that this change is correct. It could be wrong.
Instead of sharding the files, we shard the filenames.
I used the example described in the source code of tensorflow. https://github.com/tensorflow/tensorflow/blob/1c7f292617e7c276d03c51894928004625f8ec29/tensorflow/python/data/ops/dataset_ops.py#L1642
This mean that each MPI process only manage a random set of files, instead of a random set of images from all files. This change remove the performance bottleneck (~3GB/s).
Test using a RAMDISK as a filesystem with 25 simulated H100 & 4 reader threads for resnet50 (MLPerf). Everything is in RAM for this test
Above >30 GPU the "with PR" should work (>90% AU) but it hangs before giving the final result. I think it is simply due to the fact that I should try with a larger dataset. DLIO profiler is stuck something like this.