deepjavalibrary / djl

An Engine-Agnostic Deep Learning Framework in Java
https://djl.ai
Apache License 2.0
4.11k stars 653 forks source link

PyTorch -> TorchScript conversion #2340

Closed pds2208 closed 3 months ago

pds2208 commented 1 year ago

Description

Would it be possible to provide the code you have used to generate the TorchScript version of, in particular, the Whisper model? ATM, the demo uses the small.en model and it would be very useful to be able to use the multi language models as well as the different size models.

Will this change the current api? How?

No

Who will benefit from this enhancement?

Everyone

lanking520 commented 1 year ago

Here: http://docs.djl.ai/master/examples/docs/whisper_speech_text.html#trace-the-model

pds2208 commented 1 year ago

Thanks! But now it's looking for mel_80_filters.npz. Where do I find this?

lanking520 commented 1 year ago

https://github.com/deepjavalibrary/djl/blob/master/extensions/audio/src/main/java/ai/djl/audio/processor/LogMelSpectrogram.java line 42 is how you generate it

frankfliu commented 1 year ago

@pds2208 You can find the file in the whisper_en.zip file, if you want to package your model, you can use whisper_en.zip as example.

You can also download it directly from: https://resources.djl.ai/demo/pytorch/whisper/mel_80_filters.npz

pds2208 commented 1 year ago

Thanks guys. Working fine but very, very slow. It takes ~40s to translate a small audio file. The same file takes ~5s using the python implementation. Any way to speed this up?

pds2208 commented 1 year ago

Well this is interesting. Setting JniUtils.setGraphExecutorOptimize(false); as per this page:

https://djl.ai/docs/development/inference_performance_optimization.html

Reduced the time from ~40s to ~10s for the exact same file! What is going on?

frankfliu commented 1 year ago

jit::setGraphExecutorOptimize() will profile your model in the first a couple runs and trying to reduce the latency for the rest of inference. Some model will see significant latency improve, some doesn't.

With setGraphExecutorOptimize turned on, you usually see much longer latency for the 2nd (and 1st) inference per thread. I created a PR in example to turn off setGraphExecutorOptimize: https://github.com/deepjavalibrary/djl/pull/2341

pds2208 commented 1 year ago

Thanks. I’m trying to figure out why it’s half the speed of the Python version, which forks off ffmpeg to run the audio conversion. It should be at least as fast…

frankfliu commented 1 year ago

On my mac, once the model is warmed up, it only takes 4s to run inference.

pds2208 commented 1 year ago

Oh wow. The example closes the model after each use. I tried keeping it open but received an error "Native resource has been release already". How do I keep the model open to reuse it?

frankfliu commented 1 year ago

Please get the latest code, I just fixed the issue in the previous PR

pds2208 commented 1 year ago

Thanks. Grabbed the latest code. Still 10s here.

lanking520 commented 1 year ago

@pds2208 I would suggest we isolate the audio processing part and just key inference. Could you try to save the audio into npz and use Java/Python both load that (DJL have NDList function to load an npz) and see how long inference takes. Ideally java and python should works similar