NVIDIA / DALI

A GPU-accelerated library containing highly optimized building blocks and an execution engine for data processing to accelerate deep learning training and inference applications.
https://docs.nvidia.com/deeplearning/dali/user-guide/docs/index.html
Apache License 2.0
5.15k stars 621 forks source link

Is it possible to execute custom operation on multiple cuda streams? #3261

Closed whnbaek closed 1 year ago

whnbaek commented 3 years ago

Hello! I am currently working on attaching a custom operation to DALI to support GPU version RandAugment. RandAugment consists of 13 sub-ops and each op can be executed in parallel. So I consider executing these sub-ops on different cuda streams (i.e. 13 streams) inside RandAugment operator. However, it seems the framework matches one operator to one cuda stream. (StreamPool class creates stream outside the operator and give one stream to the operator in a form of WorkSpace class) That is, RandAugment secures only one stream while it needs 13. I think, merely executing sub-ops on different streams without considering pipeline policy can ruin performance. Is there any way to count these sub-ops inside the pipeline scheduler or any idea to solve this problem?

klecki commented 3 years ago

Hi @baneling100, DALI Pipeline executor uses one stream for GPU-stage (all GPU operators). It runs the operators in topological order. In theory you could create 13 streams in the instance of your operator, but you would need to have events synchronizing those 13 RandAugment streams with the one in the DALI Workspace - there is already such mechanism that ensures that operators read their inputs after they were produced.

The main performance consideration for DALI is that operators are processing batches. For bigger batches the overhead of scheduling a kernel run for every sample compared to scheduling it once for whole batch is significant. It might be hard to write kernels that process samples in the batch differently, we never tried that.

We are in progress of adding conditional processing to DALI, but RandAugment might be outside of the scope of this effort (we will be supporting AutoAugment-like use cases). The problem with RandAugment is when you select several augmentation in random order for given sample - you would need to express all the possible combinations inside DALI graph. Selecting one of the sub-policies (so one of several paths in the graph for given sample) will be supported.

whnbaek commented 3 years ago

Thank you for your reply, @klecki. Seeming we are in an opposite time zone :) Now I'm trying to make RandAugment operator with only one stream, one kernel like others. Anyway, I'm curious about details of your team's plan. As my understanding, two things will be added.

  1. Conditional processing I saw your 2021 roadmap and your words, but not sure what that exactly means. Does it look like this?

    imgs = fn.operator(imgs, probability = ...)

    Is the operator applied with the given probability to the minibatch (batch-wise) or to each sample in the minibatch (sample-wise)? I guess sample-wise... right?

  2. Selecting one of the sub-policies Is it also a sample-wise selection?

    imgs = fn.select(imgs1, imgs2, imgs3, ..., )

With these two supports, I can make AutoAugment or RandAugment. AutoAugment

# given imgs
# sub-policy 1
imgs1 = fn.sample_wise_op1(imgs, probability = p1, magnitude = m1)
imgs2 = fn.sample_wise_op2(img1, probability = p2, magnitude = m2)
# sub-policy 2
imgs3 = fn.sample_wise_op3(imgs, probability = p3, magnitude = m3)
imgs4 = fn.sample_wise_op4(img3, probability = p4, magnitude = m4)
# sub-policy 3
imgs5 = fn.sample_wise_op5(imgs, probability = p5, magnitude = m5)
imgs6 = fn.sample_wise_op6(img5, probability = p6, magnitude = m6)
# sub-policy 4
imgs7 = fn.sample_wise_op7(imgs, probability = p7, magnitude = m7)
imgs8 = fn.sample_wise_op8(img7, probability = p8, magnitude = m8)
# sub-policy 5
imgs9 = fn.sample_wise_op9(imgs, probability = p9, magnitude = m9)
imgs10 = fn.sample_wise_op10(img9, probability = p10, magnitude = m10)
# select
out = fn.sample_wise_select(imgs2, imgs4, imgs6, imgs8, imgs10)

and RandAugment

# given imgs
imgs1 = fn.op1(imgs, magnitude = m) # batch-wise operator
imgs2 = fn.op2(imgs, magnitude = m)
...
imgs13 = fn.op13(imgs, magnitude = m)
# select
out = fn.sample_wise_select(imgs, imgs1, imgs2, ..., imgs13) # imgs for identity operation

Can you check if my understanding is right?

klecki commented 3 years ago

Seeming we are in an opposite time zone :)

Yes, it would appear so :)

I saw your 2021 roadmap and your words, but not sure what that exactly means.

The final APIs are still taking shape, and we just started the necessary work we need to do first in the backend.

We are still discussing API similar to what you wrote here:

imgs = fn.operator(imgs, probability = ...)

in our designs it is currently named mask, but the idea is the same. There are some limitations though - you don't have control on the "else" case, and it won't work with operators that can change the type or number of dimensions in the output - the first one maybe we could insert a fn.cast, but it's getting complicated.

Most probably, we will introduce if statements, so the code will look like:

if fn.random.coin_flip():
    imgs = fn.operator(imgs)
else:
    imgs = fn.other_operator(imgs)

keeping the appearances of per-sample flow and at the same time processing batches. For AutoAugment you will be able to have branches for each sub-policy.

For the implementation of applying operations conditionally, we plan to add two new operators, that will split batch into smaller ones, and another that can merge the batches back. So, internally, we will have "minibatches". Each conditional branch, we be applying the operator to a minibatch, that way we still utilize the batch approach and benefit from it with bigger batches and when we don't have too many conditions.

The select_sample_wise or similar, will be supported, but we want to conditionally run the augmentation for given sample, rather than compute every possibility for the whole batch and build the result at the end - this is lots of unnecessary work.

From what I saw in RandAugment paper, they showed a pseudocode that selected k augmentations out of the set in random order and applied them to a sample. With DALI's graph approach it would be rather hard to express as I said.

If you can limit the number of permutations or you are fine with applying them in some particular order it might be doable.

whnbaek commented 3 years ago

Thank you for your very detailed explanation. Now I understood what you're doing. Can you finally check my understanding?

AutoAugment is something like

# split batches, with 1 sample included randomly in one of 5 minibatches
batches = fn.split_batches(imgs, num_batches = 5) # 5 TensorLists
# policy 1
if fn.random.coin_flip(probability = p1): # conditional execution per-sample
    batches[0] = fn.op1(batches[0], magnitude = m1)
# else flow is for identity op
if fn.random.coin_flip(probability = p2):
    batches[0] = fn.op2(batches[0], magnitude = m2)
...
# policy 5
if fn.random.coin_flip(probability = p9):
    batches[4] = fn.op9(batches[4], magnitude = m9)
if fn.random.coin_flip(probability = p10):
    batches[4] = fn.op10(batches[4], magnitude = m10)
# merge batches
imgs = fn.merge_batches(batches)

and RandAugment is something like

batches = fn.split_batches(imgs, num_batches = 14) # 14 TensorLists
batches[0] = fn.op1(batches[0], magnitude = m)
batches[1] = fn.op2(batches[1], magnitude = m)
...
batches[12] = fn.op13(batches[12], magnitude = m)
# batches[13] is for identity op
imgs = fn.merge_batches(batches)
# the process above is for 1 layer of RandAugment, if K augmentations (K layers) needed, write the above K times

Sorry for asking many times.

klecki commented 3 years ago

In the current state I think the AutoAugment example can look like this:

policy = fn.random.uniform(values=[0, 1, 2, 3, 4]) # 5 Policies
if policy == 0:  
    images = fn.op1(images, magnitude = m1) 
    images = fn.op2(images, magnitude = m2) 
if policy == 1:
    images = fn.op1(images, magnitude = m3) 
    images = fn.op2(images, magnitude = m4) 
...

if policy == 4:
    if fn.random.coin_flip(probability = p5):
        images = fn.op1(images, magnitude = m5) 
    if fn.random.coin_flip(probability = p6):
        images = fn.op2(images, magnitude = m6) 
    if fn.random.coin_flip(probability = p7):
        images = fn.op3(images, magnitude = m7) 

For simplicity I used probabilities for applying specific ops in the given subpolicy only for the last branch (policy == 4). That could be done by either nested if statement as shown, or we will have the probability/mask argument that you could mix in there as well - that's an open question now.

The if statement itself is intended to do the split/merge operations for the user, and internally it probably will be close to the example you provided.

For the RandAugment, I think you are correct, and that is a good suggestion. Having one layer that you proposed (split to 14 "mini-batches", apply one of the augmentation, merge back), and using that layer in a sequence can probably achieve the desired effect. You just need to make sure, you don't repeat the same augmentation for given sample - you can probably craft such conditional statement or if needed we can maybe add a specialized random generator. But it's hard to tell how much splitting to many paths (here we have 13 or 14) will affect the performance.

I will be taking those suggestions into account in our designs, thanks for the input!

whnbaek commented 3 years ago

You just need to make sure, you don't repeat the same augmentation for given sample

Oh, seeming we have a little misunderstanding. As I've understood, the same operation can be applied more than twice in RandAugment. The code in the paper is like this.

transforms = [
’Identity’, ’AutoContrast’, ’Equalize’, ’Rotate’, ’Solarize’, ’Color’, ’Posterize’, ’Contrast’, ’Brightness’, ’Sharpness’, ’ShearX’, ’ShearY’, ’TranslateX’, ’TranslateY’]
def randaugment(N, M):
    sampled_ops = np.random.choice(transforms, N)
    return [(op, M) for op in sampled_ops]

According to np.random.choice description, the parameter replace is True in default. That means, this code can selects the same augmentation more than twice. The paper also tells us that

RandAugment may thus express KN potential policies.

not kCn. So it can be fine to split the process of applying K augmentations out of the set into K layers with each layer applying only one augmentation out of the set.

whnbaek commented 3 years ago

Thinking hard, if we have one of the two (if/else or split/merge), we can implement sample-wise and conditional executions! RandAugment is a kind of subset of AutoAugment, so let's give AutoAugment as an example. If we only have if/else support, as you already mentioned, we can implement it with fn.random.uniform and fn.random.coin_flip. Otherwise we have split/merge, AutoAugment can be

minibatches = fn.split(imgs, probability = [0.2, 0.2, 0.2, 0.2, 0.2])
# policy 1
policy1_batches = fn.split(minibatches[0], probablity = [p1, 1 - p1])
policy1_batches[0] = fn.op1(policy1_batches[0], ...)
minibatches[0] = fn.merge(policy1_batches)
policy1_batches = fn.split(minibatches[0], probablity = [p2, 1 - p2])
policy1_batches[0] = fn.op2(policy1_batches[0], ...)
minibatches[0] = fn.merge(policy1_batches)
...
# policy 5
...
imgs = fn.merge(minibatches)

Or even further, we can merge nested fn.split into a huge one fn.split by multiplying some probabilites. If you are in a trouble of supporting if/else, then introducing split/merge can be a good idea.

klecki commented 3 years ago

You are right, I thought np.random.choice returns elements without repetition. Thanks for the correction and the additional link!

Thinking hard, if we have one of the two (if/else or split/merge), we can implement sample-wise and conditional executions!

Yes, the if/else is basically intended as syntactic sugar over the manual split/merge, and will be effectively "compiled" to it. We want to use it first rather than just directly exposing split/merge, as the "scope" of if should limit your chance of trying to work with mini batches of different sizes (the split doesn't have to result in equally sized mini batches).

JulianKlug commented 3 years ago

@baneling100 : Would you mind sharing your code for the whole RandAugment implementation with Dali?

whnbaek commented 3 years ago

@JulianKlug Sorry for the late reply. I've changed my research direction, which means I cannot give you a meaningful code of R.A. attached to DALI. It's remained unfinished. You may have to wait until DALI supports conditional operations needed to implement R.A. in Python.

JulianKlug commented 3 years ago

@baneling100 - Ok thank you!

CoinCheung commented 2 years ago

What is the situation of this please? Does dali support randaugment natively now?

JanuszL commented 2 years ago

Hi @CoinCheung,

To support your ask DALI needs to support conditional execution - an ability to apply different operators per sample. We are actively working on this and I hope to have it available soon.

klecki commented 1 year ago

Hi, the support for conditional execution in DALI, expressed as if statements, was recently merged and is already available in the nightly builds and will be a part of upcoming 1.23 release. The functionality is now available as experimental, we are still working on adding more features to it, and the documentation is currently under review in this PR: https://github.com/NVIDIA/DALI/pull/4589. You can preview the tutorial from that PR under this link: https://github.com/klecki/DALI/blob/cond-intro-tutorial/docs/examples/general/conditionals.ipynb, I will post an update when the actual documentation is visible in the nightly docs.

As for the AutoAugment and RandAugment we have some prototypes based on the if/conditionals support which we are testing to iron out some details of our implementation, and we plan to introduce them as a part of library, so they can be easily used as well as customized. We are also investigating TrivialAugment. I will be posting updates when we have something ready.

CoinCheung commented 1 year ago

@klecki HI, Thanks for working on this !!! DId you implement some benchmark of the dali randaug and PIL randaug? Or did you try to train mocov2 with dali and can the results be aligned to the original implementation?

klecki commented 1 year ago

Hi @CoinCheung, one phase of the testing is running DALI with AutoAugment in our EfficientNet implementation (replacing the PyTorch's implementation of AA based on PIL transforms): https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/Classification/ConvNets/efficientnet/README.md

We are working on some additional materials related to both the conditionals and the automatic augmentation, we will try to put some benchmarks in there.

klecki commented 1 year ago

Hi @whnbaek, @JulianKlug, @CoinCheung starting with DALI 1.24 we support AutoAugment, RandAugment and TrivialAugment. DALI 1.25 released recently includes further improvements and additional policies.

You can read more in the nvidia.dali.auto_aug module documentation available here: docs.nvidia.com/deeplearning/dali/user-guide/docs/auto_aug/auto_aug.html

There is also an example adopted from Deep Learning Examples repository showcasing the usage of AutoAugment with DALI for EfficientNet training: docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/use_cases/pytorch/efficientnet/readme.html

@CoinCheung, we just published a blog post regarding automatic augmentations, where we benchmarked AutoAugment implemented in DALI versus one based on PyTorch data loader and PIL image transformations: https://developer.nvidia.com/blog/why-automatic-augmentation-matters/

We don't have a benchmark for the RandAugment specifically, but DALI uses the same underlying implementation for all automatic augmentations (AutoAugment, RandAugment and TrivialAugment) and they should behave similarly with respect to the performance.

klecki commented 1 year ago

As for the conditional execution itself, here is our documentation for this feature: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/pipeline.html#conditional-execution

And here is a tutorial with an example: https://docs.nvidia.com/deeplearning/dali/user-guide/docs/examples/general/conditionals.html