Open junoriosity opened 1 year ago
@junoriosity thank you for your inquiry.
The delay is likely caused by model compilation. A model needs to be compiled for running on Neuron. For inferences a model can be precompiled and serialized as described here:
Unfortunately, we don't support OPT serialization yet but it will be available in future releases.
@awsilya What was also a bit disappointing is that the actual performance was not better than running these OPT models on a G4 GPU ... sometimes even slower.
Do you have an idea, why is that?
Hello @junoriosity,
There are numerous reasons why performance on an inf2 may be different than a GPU instance. Under the hood, they're fundamentally different architectures, with different strengths. I'd encourage you to take a look at the techniques described in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-debug.html to see if there are any obvious performance optimizations available.
I managed to compile the notebook in the samples to load an OPT model in
inf2
chips. :slight_smile:However, at one point I load the network and put it to neuron.
and if I take a smaller model and increase the batch size, it can take ages (20 minutes or so).
Since I try to dockerize my network, can I somehow speed that up, such that my containers start up fast on Kubernetes?