Closed dacorvo closed 1 year ago
I did more tests, changing the default compiler optimization option from -O2
to -O1
, and I am able to use the same configurations I used with transformers-neuronx==0.6.106
: batch_size=1
and n_positions=2048
.
During inference, the device memory is at 64 Gb for the 13B model and 22 Gb for the 7B model.
I also tested with -O3
but got the same kind of errors.
Thank you for reporting the issue. We are replicating the issue on our end and will get back with a fix.
Hi What’s the throughput tokens/sec did u get on 7 billion model ?
With the 2.14.1
compiler (neuronx-cc
), I am able to compile the llama2 7B
model with -O1
for different batch sizes.
I tested several combinations of cores / batch size with the default maximum sequence length for llama model (2048
).
Here are the results:
| cores/batch | 128 tokens | 512 tokens | 1024 tokens | 2048 tokens | Throughput |
|-------------|------------|------------|-------------|-------------|--------------|
| 2c / bs2 | 8.5 s | 34 s | 69 s | 143 s | 29 tokens/s |
| 2c / bs4 | 8.6 s | 35 s | 72 s | 150 s | 55 tokens/s |
| 24c / bs2 | 1.3 s | 5.4 s | 11.5 s | 22.8 s | 180 tokens/s |
| 24c / bs4 | 1.4 s | 5.8 s | 11.5 s | 24 s | 341 tokens/s |
Note: I experienced extremely long compilation times for batch size 4 (more than 3 hours), even with -O1
, when it takes only minutes for batch size 1 or 2.
@dacorvo thank you for confirming. Yes, batch 4 compilation time is an issue, we are working on it and it's been tracked elsewhere. I'm closing this one.
closing
On most open-source projects, issues are closed only when they have been resolved, so that users:
Hi @dacorvo:
We confirmed that the Llama 7B compilation error you reported is fixed in the 2.15.2 Release. Can you install the latest Neuron SDK and try re-running your script to confirm that you no longer see compilation issues for this model?
I was previously able to compile
llama 2 7B
using tensor parallelism on 2 Neuron Cores, with the defaultn_positions=2048
and abatch_size=1
.With
transformers-neuronx==0.7.84
andneuronx-cc==2.10.0.34
, I get the following error:I only managed to compile the model properly setting
batch_size=1
andn_positions=784
.With that configuration, the device memory during inference in
neuron-top
is at 20.4 G (out of 16 G x 2 cores = 32 G).I did another test this time splitting the model on 24 Neuron Cores, and faced the same error. In that configuration however, I managed to get up to
n_positions=1536
.If I try to estimate the KV cache memory requirements for the 7B models, knowing that:
2 x hidden_size x token_byte_size = 2 x 4096 x 2
),It gives
n_positions * token_size = 2048 * 512 = 1 GB
per batch.Considering that:
As a final note, I face the same kind of errors when using the larger
llama 2 13B
model. I was previously able to compile and run it just fine on 24 Neuron Cores forn_positions=2048
andbatch_size=2
, but now I only manage to run it withn_positions=1024
andbatch_size=1
.