Fast loading of neural network to Inf2

aws-neuron / aws-neuron-samples

Example code for AWS Neuron SDK developers building inference and training applications

Other

101 stars 31 forks source link

Fast loading of neural network to Inf2 #8

Open junoriosity opened 1 year ago

junoriosity commented 1 year ago

I managed to compile the notebook in the samples to load an OPT model in inf2 chips. :slight_smile:

However, at one point I load the network and put it to neuron.

neuron_model = OPTForSampling.from_pretrained('./opt-13b-split', batch_size=2, tp_degree=2, amp='f16')
neuron_model.to_neuron()

and if I take a smaller model and increase the batch size, it can take ages (20 minutes or so).

Since I try to dockerize my network, can I somehow speed that up, such that my containers start up fast on Kubernetes?

awsilya commented 1 year ago

@junoriosity thank you for your inquiry.

The delay is likely caused by model compilation. A model needs to be compiled for running on Neuron. For inferences a model can be precompiled and serialized as described here:

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/transformers-neuronx/readme.html?highlight=transformers-neuronx#serialization-support

Unfortunately, we don't support OPT serialization yet but it will be available in future releases.

junoriosity commented 1 year ago

@awsilya What was also a bit disappointing is that the actual performance was not better than running these OPT models on a G4 GPU ... sometimes even slower.

Do you have an idea, why is that?

aws-taylor commented 1 month ago

Hello @junoriosity,

There are numerous reasons why performance on an inf2 may be different than a GPU instance. Under the hood, they're fundamentally different architectures, with different strengths. I'd encourage you to take a look at the techniques described in https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/programming-guide/training/pytorch-neuron-debug.html to see if there are any obvious performance optimizations available.