NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.06k stars 615 forks source link

How to make TFRecordReader shuffle in multi tfrecord files? #2072

Open rivergold opened 4 years ago

rivergold commented 4 years ago

If I have multi tfrecord files, each file has a specific class label. For example, file a's class label is a, file b's class label is b... Referring to #874, TFRecordReader does local shuffle, it means that during training, the mini-batch will have the same class id. If I want to have a different class label in a batch, can I solve this problem without rewrite tfrecord in shuffle?

JanuszL commented 4 years ago

Hi, It is not possible now. The only thing that comes to my mind that mind is to allow interleave samples from each TFRecord file when DALI builds a mapping between files and TFRecord indexes here. However, it is hard to tell how this will impact the perf when we will keep opening and closing files at each entry read (by default DALI mmaps files and keep them open in the cache so it may be not that terrible). We would be more than happy to review a PR that would add such functionality.