Open binliunls opened 2 months ago
As the Range
function is added to the bundle, we can take a look at the latency detail of one inference iteration, which is shown below. All the gray boxes under the SW_Patchforward_Loop
stand for a model computation/prediction call with the VISTA3D network mentioned as SW_Model_Computation
in the image. Since the bundle uses the sliding window inference and the given image size(512x512x77) is larger than the sliding window size(128x128x128), one sliding window inference iteration actually contains several iterations of SW_Model_Computation
.
By zooming into the red box, we can further analyze the latency of one SW_Model_Computation
iteration. As shown in the image below, each SW_Model_Computation
will excute the cudaFree(pink) --> cuStreamSynchronize (green)
in order. These overhead function calls take a large percentage of SW_Model_Computation
, which are introduced by the loop inference.
As we analyze above, the overhead funcion calls are the bottle neck of the inference latency of the current bundle. In order to reduce them, we can improve the sw_batch_size
parameter of the sliding window inference. Please note that this will also need more GPU memory. When we set the sw_batch_size
, the latency detail looks like below. The latency has been optimized from 3.53 seconds to 2.36 seconds.
In addition to the inference iterations, the second and third time consumers are LoadingImage
and SaveImage
.
The relation of latency vs sw_batch_size is:
sw_batch_size | 1 | 4 | 8 | 10 | 12 | 14 | 16 | 20 |
---|---|---|---|---|---|---|---|---|
spleen_12 (512x512x168) | 2.771 | 2.085 | 2.351 | 2.208 | 1.906 | 2.095 | 1.987 | 1.896 |
spleen_38 (512x512x100) | 5.280 | 3.887 | 3.535 | 3.689 | 3.691 | 3.592 | 3.838 | 3.547 |
spleen_10 (512x512x55) | 3.404 | 2.803 | 2.508 | 2.362 | 2.289 | 2.301 | 2.594 | 2.74 |
spleen_9 (512x512x41) | 2.772 | 2.066 | 2.071 | 2.168 | 1.788 | 2.186 | 2.079 | 1.924 |
Set the sw_batch_size
to 14 and run the bundle with TRT inference (only compile encoder and run all classes segmentation). The benchmarks are shown below. The upper one is the latency detail about the original bundle, while the lower one is TRT bundler.
Hi @borisfom I didn't see a significant improvement about the encoder. The inference latency of TRT and non-TRT are nearly the same. Could you please offer some suggestion here? Thanks in advance!
Original bundle
TRT bundle
@binliunls : well that seems to be the case TRT does not help much with this net - I was not able to run batch=14 on my box but batch=8 gave similar result. In fact, batch=1 does not give a lot of improvement, too. It looks like TRT is currently only running a small fraction of encoder's forward action - and expanding that may not be straightforward. I will look a bit more at the model, but most likely, expanding converted part won't be possible as that would make it dependent on dynamic arguments.
@binliunls : well that seems to be the case TRT does not help much with this net - I was not able to run batch=14 on my box but batch=8 gave similar result. In fact, batch=1 does not give a lot of improvement, too. It looks like TRT is currently only running a small fraction of encoder's forward action - and expanding that may not be straightforward. I will look a bit more at the model, but most likely, expanding converted part won't be possible as that would make it dependent on dynamic arguments.
This does improve the inference on the V100 32GB GPU with max batchsize 6. Here are the details.
MONAI bundle:
TRT bundle:
Note that there is a memory malloc that is unecessary for all classes inference, which is related to this line
@binliunls : wow, that's massive sync apparently caused by waiting for TRT results - does it actually help removing it though ?
@binliunls : wow, that's massive sync apparently caused by waiting for TRT results - does it actually help removing it though ?
Hi @borisfom , the result didn't use TRT. It's just a straightforward MONAI bundle, because on A100 they are basically the same performance and MONAI bundles are easier to run. And yes, remove it will help to improve latency, since the removing of cudaMalloc has already reduced like 200-300ms. I will try to figure out where these API call happened in the code and see if we can further improve the performance.
Thanks
Bin
The embedding mask part of the classification header is another high latency part that can be optimized, which is shown in the image below.
It's using a for-loop in python to perform the tensor multiplication which is low efficiency. The code snippet is like:
b, c, h, w, d = src.shape
masks = []
for i in range(b):
mask = class_embedding @ src[[i]].view(1, c, h * w * d)
masks.append(mask.view(-1, 1, h, w, d))
We can refactor it to do a broadcast tensor multiplication call like:
b, c, h, w, d = src.shape
c = class_embedding.squeeze() @ src.view(b, c, h * w * d)
Here is a simple test case to test these two implement are the same:
import torch
def mat_mul2(class_embedding, src):
b, c, h, w, d = src.shape
c = class_embedding.squeeze() @ src.view(b, c, h * w * d)
c = c.view(b, -1, h, w, d)
return torch.transpose(c, 0, 1)
def mat_mul1(class_embedding, src):
b, c, h, w, d = src.shape
masks = []
for i in range(b):
mask = class_embedding @ src[[i]].view(1, c, h * w * d)
masks.append(mask.view(-1, 1, h, w, d))
return torch.cat(masks, 1)
a = torch.rand((17, 1, 4))
b = torch.rand(4, 4, 12, 12, 12)
ans1 = mat_mul1(a, b)
ans2 = mat_mul2(a, b)
assert torch.allclose(ans1, ans2)
After replacing the embedding mask calculation with the new way, we can reduce the whole sliding inference latency from 2.302s to 1.830s. The embedding mask part latency reduced from 278ms to 4ms.
Description
This is a function to benchmark, analyze and optimize VISTA3D bundle all class segmentation inference to get a better latency. I will try to add all the benchmark results and analyses in the PR comments. Meanwhile the general conclusion will be updated here in the PR description.
Also need to update the MONAI core code according to this PR.
Status
Work in progress
Conclusion
TODO: