argonne-lcf / dlio_benchmark

An I/O benchmark for deep Learning applications
https://dlio-benchmark.readthedocs.io
Apache License 2.0
70 stars 30 forks source link

Shard filenames instead of images (tfreader) #197

Closed LouisDDN closed 6 months ago

LouisDDN commented 6 months ago

Important: This is a draft and I haven't really verified that this change is correct. It could be wrong.

Instead of sharding the files, we shard the filenames.

I used the example described in the source code of tensorflow. https://github.com/tensorflow/tensorflow/blob/1c7f292617e7c276d03c51894928004625f8ec29/tensorflow/python/data/ops/dataset_ops.py#L1642

    d = Dataset.list_files(pattern, shuffle=False)
    d = d.shard(num_workers, worker_index)
    d = d.repeat(num_epochs)
    d = d.shuffle(shuffle_buffer_size)
    d = d.interleave(tf.data.TFRecordDataset,
                     cycle_length=num_readers, block_length=1)
    d = d.map(parser_fn, num_parallel_calls=num_map_threads)

This mean that each MPI process only manage a random set of files, instead of a random set of images from all files. This change remove the performance bottleneck (~3GB/s).

Test using a RAMDISK as a filesystem with 25 simulated H100 & 4 reader threads for resnet50 (MLPerf). Everything is in RAM for this test

./benchmark.sh run --hosts myhost --workload resnet50 --param reader.read_threads=4 --accelerator-type h100 --num-accelerators 25  --results-dir fakeres --param dataset.num_files_train=200 --param dataset.data_folder=/mnt/ramdisk/resnet50
result (25 H100) DLIO AU Throughput (GB/s)
main 51.7146 2.475
with PR 96.5845 4.581

Above >30 GPU the "with PR" should work (>90% AU) but it hangs before giving the final result. I think it is simply due to the fact that I should try with a larger dataset. DLIO profiler is stuck something like this.