NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.04k stars 614 forks source link

Could add the custom training sequence for FileReader? #1732

Closed iyupan closed 3 years ago

iyupan commented 4 years ago

Hello, really thank the whole contributors for the great help of dali. Now, I'm encountering a new challenge. When I read some papers, they use a strategy to generate data batch. Like balancing strategy, the authors sample the same amount of samples in different classes to one mini-batch. It is difficult for me to do such a strategy by using Filereader to process the Imagenet dataset. So, I'm really looking forward to seeing that dali could support the custom training sequence.

Thanks again.

klecki commented 4 years ago

Hi, I think a simple version of this can be achieved by providing a specially crafted file_list to the FileReader Op. If you turn off the random_shuffle the samples will be read in the provided order. The order is stored in a vector image_label_pairs_, that you can find here: https://github.com/NVIDIA/DALI/blob/master/dali/operators/reader/loader/file_loader.h#L138 It's initialized several lines above.

Can I ask you to provide an example of how the strategy of taking new samples works? Or where is it described in more details?

We're open for external contributions, if you want to add such feature, just file us a PR.

Thanks, Krzysztof.

JanuszL commented 4 years ago

Hi, I think a simple version of this can be achieved by providing a specially crafted file_list to the FileReader Op. If you turn off the random_shuffle the samples will be read in the provided order.

@perryupan - it means providing order of samples for the whole training (all epochs), or you can implement own reader using the ExternalSource operator.

Regarding custom balancing strategy I don't see any general option as I can imagine a handful of different approaches that can only be expressed by custom code and won't fit into any prebaked operator, no matter how flexible it is.

iyupan commented 4 years ago

Hi, I think a simple version of this can be achieved by providing a specially crafted file_list to the FileReader Op. If you turn off the random_shuffle the samples will be read in the provided order. The order is stored in a vector image_label_pairs_, that you can find here: https://github.com/NVIDIA/DALI/blob/master/dali/operators/reader/loader/file_loader.h#L138 It's initialized several lines above.

Can I ask you to provide an example of how the strategy of taking new samples works? Or where is it described in more details?

We're open for external contributions, if you want to add such feature, just file us a PR.

Thanks, Krzysztof.

Sorry for the late reply.

An example is from the paper "Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks", and the original words are :

To address this issue, we apply a sampling strategy, named “class-aware sampling”, during training. We aim to fill a mini-batch as uniform as possible with respect to classes, and prevent the same example and class from always appearing in a permanent order. In practice, we use two types of lists, one is class list, and the other is per-class image list. Taking Places2 challenge dataset for example, we have one class list, and 401 per-class image lists. When getting a training mini-batch in an iteration, we first sample a class X in the class list, then sample an image in the per-class image list of class X. When reaching the end of the per-class image list of class X, a shuffle operation is performed to reorder the images of class X. When reaching the end of class list, a shuffle operation is performed to reorder the classes. We leverage such a class-aware sampling strategy to effectively tackle the non-uniform class distribution, and the gain of accuracy on the validation set is about 0.6%.

And, using ExternalSource operator look like a good idea.

iyupan commented 4 years ago

Hi, I think a simple version of this can be achieved by providing a specially crafted file_list to the FileReader Op. If you turn off the random_shuffle the samples will be read in the provided order.

@perryupan - it means providing order of samples for the whole training (all epochs), or you can implement own reader using the ExternalSource operator.

Regarding custom balancing strategy I don't see any general option as I can imagine a handful of different approaches that can only be expressed by custom code and won't fit into any prebaked operator, no matter how flexible it is.

Hello, using ExternalSource operator sounds like a good idea, and I'll try it later. Thanks.

JanuszL commented 4 years ago

You can still prebake (randomly) the whole training sequence and write it to the file_list. From the training point of view it is still random (although baked offline).

iyupan commented 4 years ago

You can still prebake (randomly) the whole training sequence and write it to the file_list. From the training point of view it is still random (although baked offline).

The difficulty of setting the file_list of the FileReader is that the data sequences of different epochs are expected to be different. So keeping balancing training and different sequence may seem to be tricky.

JanuszL commented 4 years ago

If you provide the sequence of the length of the whole training then it should work - DALI one epoch would just be as long as your whole training.

iyupan commented 4 years ago

If you provide the sequence of the length of the whole training then it should work - DALI one epoch would just be as long as your whole training.

Oh, I think it would work. Haha, thank you.

JanuszL commented 3 years ago

@perryuu - small update. Now the file reader accepts files argument and you can provide any sequence of files you like. If you turn off the shuffling they will be returned in the order provided.