Project-MONAI / model-zoo

MONAI Model Zoo that hosts models in the MONAI Bundle format.
Apache License 2.0
204 stars 69 forks source link

Benchmark VISTA3D #671

Open binliunls opened 2 months ago

binliunls commented 2 months ago

Description

This is a function to benchmark, analyze and optimize VISTA3D bundle all class segmentation inference to get a better latency. I will try to add all the benchmark results and analyses in the PR comments. Meanwhile the general conclusion will be updated here in the PR description.

Also need to update the MONAI core code according to this PR.

Status

Work in progress

Conclusion

  1. A larger sw_batch_size (~14) can optimize the latency on the A100(80GB) GPU.

TODO:

binliunls commented 2 months ago

As the Range function is added to the bundle, we can take a look at the latency detail of one inference iteration, which is shown below. All the gray boxes under the SW_Patchforward_Loop stand for a model computation/prediction call with the VISTA3D network mentioned as SW_Model_Computation in the image. Since the bundle uses the sliding window inference and the given image size(512x512x77) is larger than the sliding window size(128x128x128), one sliding window inference iteration actually contains several iterations of SW_Model_Computation.

image

By zooming into the red box, we can further analyze the latency of one SW_Model_Computation iteration. As shown in the image below, each SW_Model_Computation will excute the cudaFree(pink) --> cuStreamSynchronize (green) in order. These overhead function calls take a large percentage of SW_Model_Computation, which are introduced by the loop inference.

image

As we analyze above, the overhead funcion calls are the bottle neck of the inference latency of the current bundle. In order to reduce them, we can improve the sw_batch_size parameter of the sliding window inference. Please note that this will also need more GPU memory. When we set the sw_batch_size, the latency detail looks like below. The latency has been optimized from 3.53 seconds to 2.36 seconds.

image

In addition to the inference iterations, the second and third time consumers are LoadingImage and SaveImage.

image

binliunls commented 2 months ago

The relation of latency vs sw_batch_size is:

sw_batch_size 1 4 8 10 12 14 16 20
spleen_12 (512x512x168) 2.771 2.085 2.351 2.208 1.906 2.095 1.987 1.896
spleen_38 (512x512x100) 5.280 3.887 3.535 3.689 3.691 3.592 3.838 3.547
spleen_10 (512x512x55) 3.404 2.803 2.508 2.362 2.289 2.301 2.594 2.74
spleen_9 (512x512x41) 2.772 2.066 2.071 2.168 1.788 2.186 2.079 1.924

image

binliunls commented 2 months ago

Set the sw_batch_size to 14 and run the bundle with TRT inference (only compile encoder and run all classes segmentation). The benchmarks are shown below. The upper one is the latency detail about the original bundle, while the lower one is TRT bundler.

Hi @borisfom I didn't see a significant improvement about the encoder. The inference latency of TRT and non-TRT are nearly the same. Could you please offer some suggestion here? Thanks in advance!

Original bundle image

TRT bundle image

borisfom commented 2 months ago

@binliunls : well that seems to be the case TRT does not help much with this net - I was not able to run batch=14 on my box but batch=8 gave similar result. In fact, batch=1 does not give a lot of improvement, too. It looks like TRT is currently only running a small fraction of encoder's forward action - and expanding that may not be straightforward. I will look a bit more at the model, but most likely, expanding converted part won't be possible as that would make it dependent on dynamic arguments.

binliunls commented 2 months ago

@binliunls : well that seems to be the case TRT does not help much with this net - I was not able to run batch=14 on my box but batch=8 gave similar result. In fact, batch=1 does not give a lot of improvement, too. It looks like TRT is currently only running a small fraction of encoder's forward action - and expanding that may not be straightforward. I will look a bit more at the model, but most likely, expanding converted part won't be possible as that would make it dependent on dynamic arguments.

This does improve the inference on the V100 32GB GPU with max batchsize 6. Here are the details.

MONAI bundle: image

TRT bundle: image

binliunls commented 2 months ago

Note that there is a memory malloc that is unecessary for all classes inference, which is related to this line

image

borisfom commented 2 months ago

@binliunls : wow, that's massive sync apparently caused by waiting for TRT results - does it actually help removing it though ?

binliunls commented 2 months ago

@binliunls : wow, that's massive sync apparently caused by waiting for TRT results - does it actually help removing it though ?

Hi @borisfom , the result didn't use TRT. It's just a straightforward MONAI bundle, because on A100 they are basically the same performance and MONAI bundles are easier to run. And yes, remove it will help to improve latency, since the removing of cudaMalloc has already reduced like 200-300ms. I will try to figure out where these API call happened in the code and see if we can further improve the performance.

Thanks
Bin

binliunls commented 2 months ago

The embedding mask part of the classification header is another high latency part that can be optimized, which is shown in the image below.

image

It's using a for-loop in python to perform the tensor multiplication which is low efficiency. The code snippet is like:

b, c, h, w, d = src.shape
masks = []
for i in range(b):
    mask = class_embedding @ src[[i]].view(1, c, h * w * d)
    masks.append(mask.view(-1, 1, h, w, d))

We can refactor it to do a broadcast tensor multiplication call like:

b, c, h, w, d = src.shape
c = class_embedding.squeeze() @ src.view(b, c, h * w * d)

Here is a simple test case to test these two implement are the same:

import torch

def mat_mul2(class_embedding, src):
    b, c, h, w, d = src.shape
    c = class_embedding.squeeze() @ src.view(b, c, h * w * d)
    c = c.view(b, -1, h, w, d)
    return torch.transpose(c, 0, 1)

def mat_mul1(class_embedding, src):
    b, c, h, w, d = src.shape
    masks = []
    for i in range(b):
        mask = class_embedding @ src[[i]].view(1, c, h * w * d)
        masks.append(mask.view(-1, 1, h, w, d))
    return torch.cat(masks, 1)

a = torch.rand((17, 1, 4))
b = torch.rand(4, 4, 12, 12, 12)
ans1 = mat_mul1(a, b)
ans2 = mat_mul2(a, b)
assert torch.allclose(ans1, ans2)

After replacing the embedding mask calculation with the new way, we can reduce the whole sliding inference latency from 2.302s to 1.830s. The embedding mask part latency reduced from 278ms to 4ms.

image