[Performance Issue]: Out of memory on subsequent model predictions and SLOW

SciSharp / TensorFlow.NET

.NET Standard bindings for Google's TensorFlow for developing, training and deploying Machine Learning models in C# and F#.

https://scisharp.github.io/tensorflow-net-docs

Apache License 2.0

3.21k stars 516 forks source link

[Performance Issue]: Out of memory on subsequent model predictions and SLOW #1019

Open roushrsh opened 1 year ago

roushrsh commented 1 year ago

Brief Description

Hi,

My Elapsed Time is 368 ms

The same prediction takes 0.3ms on python (10% more with just predict). (b~15000 with cpu on c#, so gpu is definitely being used).

If I run:

var T3 = model.predict((xx1, xx5), batch_size: 128); T3 = model.predict((xx1, xx5), batch_size: 128); T3 = model.predict((xx1, xx5), batch_size: 128);#out of memory error here 'OOM when allocating tensor with shape..etc'

it runs out of memory by the second batch prediction. I have to reduce it down to 32 for it to make it through all three. Is there a way to empty the memory after each prediction? Or something I must do?

Any suggestions?

C# model.predict((xx1), batch_size: 64); 300ms (GPU) model.predict((xx1), batch_size: 64); 15000ms (CPU)

Python: Model.predict(data,verbose=False) 40ms mode(data,training=False) 0.3ms //to remove overhead

It's just a yolo model.

Device and Context

13900k, 4090 RTX

roushrsh commented 1 year ago

Batch Predict with Keras.Net is 50ms, still much slower, but ok.Problem there is, everything has to be from NDarrays, which take 250ms to prepare anyways...

AsakusaRinne commented 1 year ago

Hi, what is the original data format of you app? The reason that it takes 250ms may be memory copy when constructing NDArray.

roushrsh commented 1 year ago

Yes, it is large, 4319x4x1, so roughly equivalent 70x70x3 . Is there any way to bypass this? Prediction only takes 0.3ms on python, 4 orders of magnitude faster. CPU runs twice as fast as GPU because of this.

Oceania2018 commented 1 year ago

Batch Predict with Keras.Net is 50ms, still much slower, but ok.Problem there is, everything has to be from NDarrays, which take 250ms to prepare anyways...

Are you using TensorFlow.NET/ TensorFlow.Keras or Keras.NET, they're different library,

roushrsh commented 1 year ago

Hi @Oceania2018 ,

I'm using Tensorflow.net My only imports are:

      using Tensorflow;
      using Tensorflow.Keras.Optimizers;
      using static Tensorflow.Binding;
      using static Tensorflow.KerasApi;
      using static Tensorflow.TensorShapeProto.Types;
      using Tensorflow.NumPy;
      using System.Diagnostics;
      using System.IO;
      using System.Linq;

I have also tried Keras.Net before. it predicts faster when I use their function 'PredictOnBatch' (50ms), however the part where I have to convert to their NDarray format takes 250ms for my 4319*4 image, so it's useless as well.

Oceania2018 commented 1 year ago

This may be because Python automatically converts part of the code into a static graph for execution, but .NET does not automatically do this part of the work, and the code needs to be optimized manually. How to optimize it depends on the actual situation and the specific code. , the basic principle is to use the AutoGraph annotation to convert dynamic eager code into a static graph. I think it should be like this.

roushrsh commented 1 year ago

Thanks do you have a link to where I can read how to do that, or convert my model? I did try ONNX/MicrosoftML as well and had the same issue.

My model is just a yolo model, 9 inputs: [4319x1, 4319x1, 4319x1, 4319x1, 1,1,1,1,1] YOLO CNNs (800,000) output is a single sigmoid regression

Oceania2018 commented 1 year ago

@roushrsh There is a blog about how to user AutoGraph annotation. You can utilize this approache to optimize the prediction part.

roushrsh commented 1 year ago

Thanks @Oceania2018. I will try it and get back to you.

Looking more into it: I believe it's the 'train on batch' ability that doesn't exist on Tensorflow.net. Keras.net implemented it at some point and it gives much closer results (50ms, vs 300ms when just predict is called. In python the difference is 33 vs 210) (however I can't use Keras.net as. the NDarray format they have is extremely slow to use)

AsakusaRinne commented 1 year ago

Hi, it came to my brain today that since tensorflow 2.11, the GPU support for native-windows was dropped (refer https://github.com/tensorflow/tensorflow/issues/59905). Unfortunately the latest version of tf.net uses tf2.11 library. So if you use latest tf.net on Windows, the performance may be worse than expected.

roushrsh commented 1 year ago

Oh wow, that could be it. Should I just test it on linux or similar, or is there a solution?

AsakusaRinne commented 1 year ago

Oh wow, that could be it. Should I just test it on linux or similar, or is there a solution?

Sorry I'm wrong, the tensorflow.redist has CPU version updated to 2.11 but GPU version still 2.10. Please just ignore what I said (still, updating GPU redist to 2.11 would cause that problem but it's not the point of this issue). Would you like to provide an example to reproduce the problem? The performance issue has high priority to us and I'll try to find a solution for it.