Open roushrsh opened 1 year ago
Batch Predict with Keras.Net is 50ms, still much slower, but ok.Problem there is, everything has to be from NDarrays, which take 250ms to prepare anyways...
Hi, what is the original data format of you app? The reason that it takes 250ms may be memory copy when constructing NDArray.
Yes, it is large, 4319x4x1, so roughly equivalent 70x70x3 . Is there any way to bypass this? Prediction only takes 0.3ms on python, 4 orders of magnitude faster. CPU runs twice as fast as GPU because of this.
Batch Predict with Keras.Net is 50ms, still much slower, but ok.Problem there is, everything has to be from NDarrays, which take 250ms to prepare anyways...
Are you using TensorFlow.NET/ TensorFlow.Keras or Keras.NET, they're different library,
Hi @Oceania2018 ,
I'm using Tensorflow.net My only imports are:
using Tensorflow;
using Tensorflow.Keras.Optimizers;
using static Tensorflow.Binding;
using static Tensorflow.KerasApi;
using static Tensorflow.TensorShapeProto.Types;
using Tensorflow.NumPy;
using System.Diagnostics;
using System.IO;
using System.Linq;
I have also tried Keras.Net before. it predicts faster when I use their function 'PredictOnBatch' (50ms), however the part where I have to convert to their NDarray format takes 250ms for my 4319*4 image, so it's useless as well.
This may be because Python automatically converts part of the code into a static graph for execution, but .NET does not automatically do this part of the work, and the code needs to be optimized manually. How to optimize it depends on the actual situation and the specific code. , the basic principle is to use the AutoGraph annotation to convert dynamic eager code into a static graph. I think it should be like this.
Thanks do you have a link to where I can read how to do that, or convert my model? I did try ONNX/MicrosoftML as well and had the same issue.
My model is just a yolo model, 9 inputs: [4319x1, 4319x1, 4319x1, 4319x1, 1,1,1,1,1] YOLO CNNs (800,000) output is a single sigmoid regression
@roushrsh There is a blog about how to user AutoGraph annotation. You can utilize this approache to optimize the prediction part.
Thanks @Oceania2018. I will try it and get back to you.
Looking more into it: I believe it's the 'train on batch' ability that doesn't exist on Tensorflow.net. Keras.net implemented it at some point and it gives much closer results (50ms, vs 300ms when just predict is called. In python the difference is 33 vs 210) (however I can't use Keras.net as. the NDarray format they have is extremely slow to use)
Hi, it came to my brain today that since tensorflow 2.11, the GPU support for native-windows was dropped (refer https://github.com/tensorflow/tensorflow/issues/59905). Unfortunately the latest version of tf.net uses tf2.11 library. So if you use latest tf.net on Windows, the performance may be worse than expected.
Oh wow, that could be it. Should I just test it on linux or similar, or is there a solution?
Oh wow, that could be it. Should I just test it on linux or similar, or is there a solution?
Sorry I'm wrong, the tensorflow.redist has CPU version updated to 2.11 but GPU version still 2.10. Please just ignore what I said (still, updating GPU redist to 2.11 would cause that problem but it's not the point of this issue). Would you like to provide an example to reproduce the problem? The performance issue has high priority to us and I'll try to find a solution for it.
Brief Description
Hi,
My Elapsed Time is 368 ms
The same prediction takes 0.3ms on python (10% more with just predict). (b~15000 with cpu on c#, so gpu is definitely being used).
If I run:
var T3 = model.predict((xx1, xx5), batch_size: 128); T3 = model.predict((xx1, xx5), batch_size: 128); T3 = model.predict((xx1, xx5), batch_size: 128);#out of memory error here 'OOM when allocating tensor with shape..etc'
it runs out of memory by the second batch prediction. I have to reduce it down to 32 for it to make it through all three. Is there a way to empty the memory after each prediction? Or something I must do?
Any suggestions?
C# model.predict((xx1), batch_size: 64); 300ms (GPU) model.predict((xx1), batch_size: 64); 15000ms (CPU)
Python: Model.predict(data,verbose=False) 40ms mode(data,training=False) 0.3ms //to remove overhead
It's just a yolo model.
Device and Context
13900k, 4090 RTX