aws-neuron / transformers-neuronx

Apache License 2.0
88 stars 25 forks source link

Very long compilation times for llama2 with batch size 4 #48

Closed dacorvo closed 7 months ago

dacorvo commented 9 months ago

With AWS Neuron SDK 2.14.1, I am experiencing very long compilation times for batch_size = 4 with the llama2 7B model.

I am using the following configurations:

|             | inf2.8xlarge | inf2.48xlarge |
|-------------|--------------|---------------|
| tp_degree   | 2            | 24            |
| n_positions | 2048         | 2048          |
| amp         | f16          | f16           |

With batch_size = 1, 2 it takes minutes to compile the model with the -O1 option, but with batch_size = 4 it lasts more than three hours.

awsilya commented 9 months ago

ack, asking compiler team to take a look

aws-donkrets commented 9 months ago

Hi dacorvo - We have root caused this issue with the llama2 batch 4 config. The fix will be in the next Neuron SDK release.

dacorvo commented 9 months ago

This is excellent news ! Thanks for the update.

jeffhataws commented 7 months ago

This issue was fixed in Neuron SDK release 2.15.