Closed jestiny0 closed 1 year ago
Hi @jestiny0
Thanks for the question. Can you help us by providing some information for the following questions?
If possible, would you be able to share the torchscript model that is leading to this behavior (or a similar model so that we can reproduce easily).
- You mentioned that you are generating the torchscript model with torch=1.10.0 and then trying to use torch=1.12.1/1.11.0 to run the model. Have you tried generating the torchscript model using the version of torch you plan to serve it with? It would be good to know whether you see performance decrease when you compile the model with torch1.12.1 and run it with the same torch1.12.1. This is what I likely think the issue is, though you mentioned using torch1.11 doesn't lead to the behavior
I have tried generating the ts model using the same version of torch to serve, but unfortunately, the results were the same. Besides, I tried combinations:
Related to this, have you tried running the torchscript model (compiled on 1.10.0) in python with torch 1.12.1 and examined the execution? I'd be interested to know if you see the same behavior
both 1.10.0 and 1.12.1 examined the same executions
For the other models where performance regression is not as noticeable, do you see the same behavior when profiling? Are there missing fused ops where you expect them?
they all miss fused ops in 1.11.0/1.12.0 with djl 1.18/1.19
Are the input shapes identical across load tests for the different versions? From the doc you linked, it says "Fusion groups are only legal to run when the input shapes are exactly the same as we saw during profiling runs (they were encoded in the JIT IR before the fuser pass)".
I always keep the same loadtest driver
@siddvenk
@siddvenk My key doubtful point is that high djl versions (1.18.0 and 1.19.0) do not trigger the fuser optimizations, and you can try a simple model mentioned in this doc.
import torch
def foo(a):
b = torch.conv2d(a, torch.randn(1, 1, 1, 1)) # not fusible
x = torch.mul(b, b) # fusible
y = torch.sin(x) # fusible
z = torch.mul(y, y) # fusible
return z
torch._C._jit_override_can_fuse_on_cpu(True)
a = torch.randn(1, 1, 128, 128)
scripted = torch.jit.script(foo)
# do several runs:
for _ in range(10):
scripted(a)
and you can start djl engine with the env arg
PYTORCH_JIT_LOG_LEVEL="tensorexpr_fuser.cpp"
You will see the fused process logs in the lower versions (before 1.17.0) and not in the higher versions (1.18.0/1.19.0).
So I want to know why DJL added this line for the newer versions. I suppose it will disable the fused process.
torch::jit::GraphOptimizerEnabledGuard no_optimizer_guard{false};
I don't know if you have reffered this code , but I found this line is unnecessary and may cause some unexpected effects for DJL newer pytorch versions. (And it has existed since 2019 in the pytorch git repo)
Thanks for the updates.
The GraphOptimizerEnabledGuard seems to do two things:
I think the intended use of this guard is for native mobile applications (our android module), but it seems to be impacting performance outside of android as well.
Thanks for calling this out. Your comments and deep dive are very useful. We'll take a look at this issue and come up with a fix.
@jestiny0 we pushed a change out to disable this guard outside of the android scope. It should be available in our next snapshot release (happens nightly).
You can try using DJL version 0.20.0-SNAPSHOT + torch1.12.1 - please let us know how that goes for you.
You can try using DJL version 0.20.0-SNAPSHOT + torch1.12.1 - please let us know how that goes for you.
Thanks @siddvenk , it works well!
@siddvenk Do you have an expected release time of the new version (0.20.0) ?
Good to hear this worked for you!
We don't have a specific timeline in mind at the moment, but we typically release new DJL versions every 45-60 days. I would estimate that a new release will be available in the next 4-6 weeks.
Description
We want to upgrading our serving service's DJL(including torch version) to a newer version (DJL 1.19 with djl.pytorch.pytorch-native-cpu 1.12.1), but find one of our serving models has a very serious performance reduction. After investigations I suppose higher DJL version seems not to trigger torchscript tensorexpr fuser optimizations.
env details
training and torchscript version:
torch =1.10.0
serving djl version:
and before upgrading (current):
After upgrading
or
When we deploy higher djl versions (both djl 0.19.0+ torch native 1.12.0/1.11.0) and apply a loadtesting, we get a very serious performance reduction:
Before upgrading:
(46 cores total)
After upgrading:
What I did
I have tested on several models, but only one model has obvious performance problems. Absolutly the only one model has a more complex structure and implementation than other models. All the models use torchscript to package and serving. I use DJL's profiler to test and find there are many torchscript tensorexpr fuser optimizations in the low version but not in the higher version. Before upgrading:
After upgrading:
About torchscript tensorexpr fuser: https://github.com/pytorch/pytorch/blob/master/test/cpp/tensorexpr/tutorial.cpp Because the model is complex, the optimization is very necessary. I added logs according to this tutorial https://dev-discuss.pytorch.org/t/nnc-walkthrough-how-pytorch-ops-get-fused/125 but found that DJL 1.19.0 + torch jni 1.12.1/1.11.0 have no log output. I tried DJL 1.17.0 + torch jni 1.11.0, it works! And its performance is the same as before!
My doubt
DJL appears to be invoking the "InferenceMode" for the "newer" version of torch (After 1.17.0): https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-native/src/main/native/ai_djl_pytorch_jni_PyTorchLibrary_inference.cc
I suppose this change causes torchscript's tensorexpr fuser optimization not to be triggered, but not sure. So I want to see if you have any opinions or other doubts about this question!
Expected Behavior
There is little change in performance after the upgrade.
How to Reproduce?
A complex torchscript model with tensorexpr optimization and higher DJL version (1.18/1.19)