DJL version 1.18/1.19 + pytorch native version 1.11.0/1.12.1 seems not to trigger torchscript tensorexpr fuser optimizations

jestiny0 commented 1 year ago

Description

We want to upgrading our serving service's DJL(including torch version) to a newer version (DJL 1.19 with djl.pytorch.pytorch-native-cpu 1.12.1), but find one of our serving models has a very serious performance reduction. After investigations I suppose higher DJL version seems not to trigger torchscript tensorexpr fuser optimizations.

env details

training and torchscript version: torch =1.10.0

serving djl version:

<dependency>
      <groupId>ai.djl</groupId>
      <artifactId>api</artifactId>
      <version>${djl.version}</version>
    </dependency>
    <dependency>
      <groupId>ai.djl.pytorch</groupId>
      <artifactId>pytorch-engine</artifactId>
      <version>${djl.version}</version>
    </dependency>
    <dependency>
      <groupId>ai.djl.pytorch</groupId>
      <artifactId>pytorch-native-cpu</artifactId>
      <classifier>linux-x86_64</classifier>
      <version>${torch.version}</version>
      <scope>runtime</scope>
    </dependency>
    <dependency>
      <groupId>ai.djl.pytorch</groupId>
      <artifactId>pytorch-jni</artifactId>
      <version>${torch.version}-${djl.version}</version>
      <scope>runtime</scope>
    </dependency>

and before upgrading (current):

<djl.version>0.15.0</djl.version>
<torch.version>1.10.0</torch.version>

After upgrading

<djl.version>0.19.0</djl.version>
<torch.version>1.12.0</torch.version>

or

<djl.version>0.19.0</djl.version>
<torch.version>1.11.0</torch.version>

When we deploy higher djl versions (both djl 0.19.0+ torch native 1.12.0/1.11.0) and apply a loadtesting, we get a very serious performance reduction:

QPS 500(request) -> 450
CPU load (26% -> almost 100%)
Latency p50 23ms -> 98ms

Before upgrading:

(46 cores total)

After upgrading:

What I did

I have tested on several models, but only one model has obvious performance problems. Absolutly the only one model has a more complex structure and implementation than other models. All the models use torchscript to package and serving. I use DJL's profiler to test and find there are many torchscript tensorexpr fuser optimizations in the low version but not in the higher version. Before upgrading:

After upgrading:

About torchscript tensorexpr fuser: https://github.com/pytorch/pytorch/blob/master/test/cpp/tensorexpr/tutorial.cpp Because the model is complex, the optimization is very necessary. I added logs according to this tutorial https://dev-discuss.pytorch.org/t/nnc-walkthrough-how-pytorch-ops-get-fused/125 but found that DJL 1.19.0 + torch jni 1.12.1/1.11.0 have no log output. I tried DJL 1.17.0 + torch jni 1.11.0, it works! And its performance is the same as before!

My doubt

DJL appears to be invoking the "InferenceMode" for the "newer" version of torch (After 1.17.0): https://github.com/deepjavalibrary/djl/blob/master/engines/pytorch/pytorch-native/src/main/native/ai_djl_pytorch_jni_PyTorchLibrary_inference.cc

#ifdef V1_10_X
  torch::autograd::AutoGradMode no_autograd_guard{false};
  torch::NoGradGuard no_grad;
#else
  c10::InferenceMode guard;
  torch::jit::GraphOptimizerEnabledGuard no_optimizer_guard{false};
#endif
};

I suppose this change causes torchscript's tensorexpr fuser optimization not to be triggered, but not sure. So I want to see if you have any opinions or other doubts about this question!

Expected Behavior

There is little change in performance after the upgrade.

How to Reproduce?

A complex torchscript model with tensorexpr optimization and higher DJL version (1.18/1.19)

siddvenk commented 1 year ago

Hi @jestiny0

Thanks for the question. Can you help us by providing some information for the following questions?

You mentioned that you are generating the torchscript model with torch=1.10.0 and then trying to use torch=1.12.1/1.11.0 to run the model. Have you tried generating the torchscript model using the version of torch you plan to serve it with? It would be good to know whether you see performance decrease when you compile the model with torch1.12.1 and run it with the same torch1.12.1. This is what I likely think the issue is, though you mentioned using torch1.11 doesn't lead to the behavior
- Related to this, have you tried running the torchscript model (compiled on 1.10.0) in python with torch 1.12.1 and examined the execution? I'd be interested to know if you see the same behavior
For the other models where performance regression is not as noticeable, do you see the same behavior when profiling? Are there missing fused ops where you expect them?
Are the input shapes identical across load tests for the different versions? From the doc you linked, it says "Fusion groups are only legal to run when the input shapes are exactly the same as we saw during profiling runs (they were encoded in the JIT IR before the fuser pass)".

If possible, would you be able to share the torchscript model that is leading to this behavior (or a similar model so that we can reproduce easily).

jestiny0 commented 1 year ago

You mentioned that you are generating the torchscript model with torch=1.10.0 and then trying to use torch=1.12.1/1.11.0 to run the model. Have you tried generating the torchscript model using the version of torch you plan to serve it with? It would be good to know whether you see performance decrease when you compile the model with torch1.12.1 and run it with the same torch1.12.1. This is what I likely think the issue is, though you mentioned using torch1.11 doesn't lead to the behavior

I have tried generating the ts model using the same version of torch to serve, but unfortunately, the results were the same. Besides, I tried combinations:

torchscript 1.10.0 + serving 1.10.0 + DJL 1.15.0, works well
torchscript 1.10.0 + serving 1.11.0 + DJL 1.17.0, works well
torchscript 1.10.0 + serving 1.11.0/1.12.0 + DJL 1.19.0, not well
torchscript 1.11.0/1.12.0 + serving 1.11.0/1.12.0 + DJL 1.19.0, not well

Related to this, have you tried running the torchscript model (compiled on 1.10.0) in python with torch 1.12.1 and examined the execution? I'd be interested to know if you see the same behavior

both 1.10.0 and 1.12.1 examined the same executions

For the other models where performance regression is not as noticeable, do you see the same behavior when profiling? Are there missing fused ops where you expect them?

they all miss fused ops in 1.11.0/1.12.0 with djl 1.18/1.19

Are the input shapes identical across load tests for the different versions? From the doc you linked, it says "Fusion groups are only legal to run when the input shapes are exactly the same as we saw during profiling runs (they were encoded in the JIT IR before the fuser pass)".

I always keep the same loadtest driver

@siddvenk

jestiny0 commented 1 year ago

@siddvenk My key doubtful point is that high djl versions (1.18.0 and 1.19.0) do not trigger the fuser optimizations, and you can try a simple model mentioned in this doc.

import torch

def foo(a):
    b = torch.conv2d(a, torch.randn(1, 1, 1, 1)) # not fusible
    x = torch.mul(b, b)                          # fusible
    y = torch.sin(x)                             # fusible
    z = torch.mul(y, y)                          # fusible
    return z

torch._C._jit_override_can_fuse_on_cpu(True)

a = torch.randn(1, 1, 128, 128)

scripted = torch.jit.script(foo)

# do several runs:
for _ in range(10):
    scripted(a)

and you can start djl engine with the env arg PYTORCH_JIT_LOG_LEVEL="tensorexpr_fuser.cpp"

You will see the fused process logs in the lower versions (before 1.17.0) and not in the higher versions (1.18.0/1.19.0).

So I want to know why DJL added this line for the newer versions. I suppose it will disable the fused process. torch::jit::GraphOptimizerEnabledGuard no_optimizer_guard{false};

I don't know if you have reffered this code , but I found this line is unnecessary and may cause some unexpected effects for DJL newer pytorch versions. (And it has existed since 2019 in the pytorch git repo)

siddvenk commented 1 year ago

Thanks for the updates.

The GraphOptimizerEnabledGuard seems to do two things:

disables autograd (this is what we intended when we made the change)
disables graph optimization (explaining the behavior you are seeing since it will disable the fused op paths in the torchscript model)

I think the intended use of this guard is for native mobile applications (our android module), but it seems to be impacting performance outside of android as well.

Thanks for calling this out. Your comments and deep dive are very useful. We'll take a look at this issue and come up with a fix.

siddvenk commented 1 year ago

@jestiny0 we pushed a change out to disable this guard outside of the android scope. It should be available in our next snapshot release (happens nightly).

You can try using DJL version 0.20.0-SNAPSHOT + torch1.12.1 - please let us know how that goes for you.

jestiny0 commented 1 year ago

You can try using DJL version 0.20.0-SNAPSHOT + torch1.12.1 - please let us know how that goes for you.

Thanks @siddvenk , it works well!

jestiny0 commented 1 year ago

@siddvenk Do you have an expected release time of the new version (0.20.0) ?

siddvenk commented 1 year ago

Good to hear this worked for you!

We don't have a specific timeline in mind at the moment, but we typically release new DJL versions every 45-60 days. I would estimate that a new release will be available in the next 4-6 weeks.

deepjavalibrary / djl