Compilation errors for llama 2 models

dacorvo commented 1 year ago

I was previously able to compile llama 2 7B using tensor parallelism on 2 Neuron Cores, with the default n_positions=2048 and a batch_size=1.

With transformers-neuronx==0.7.84 and neuronx-cc==2.10.0.34, I get the following error:

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/5c616a4d-b0cc-4c8d-8768-df08facd8aec/model.MODULE_875d0cfa
b1be718dcdb8+8737852b.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/5c616a4d-b0cc-4c8d-8768-df08facd8aec/model.MODULE_875d0cfab1be718dcdb8+8737852b.neff', '--model-type=transformer', '--
model-type=transformer', '--verbose=35']: 2023-09-18T14:15:06Z Too many instructions after unroll for function sg0000 !

I only managed to compile the model properly setting batch_size=1 and n_positions=784.

With that configuration, the device memory during inference in neuron-top is at 20.4 G (out of 16 G x 2 cores = 32 G).

I did another test this time splitting the model on 24 Neuron Cores, and faced the same error. In that configuration however, I managed to get up to n_positions=1536.

If I try to estimate the KV cache memory requirements for the 7B models, knowing that:

the KV cache size for each layer should be 16 Kb per token (2 x hidden_size x token_byte_size = 2 x 4096 x 2),
the total KV cache for the 32 layers require at most 512 Kb per token.

It gives n_positions * token_size = 2048 * 512 = 1 GB per batch.

Considering that:

each core has 16 GB of memory
the 7B model float16 weights require around 14 GB,
neuron-top reports that around 4 Gb are taken by the model code and constants,
then we should be able to fit the KV cache for up to 14 batches on two cores.

As a final note, I face the same kind of errors when using the larger llama 2 13B model. I was previously able to compile and run it just fine on 24 Neuron Cores for n_positions=2048 and batch_size=2, but now I only manage to run it with n_positions=1024 and batch_size=1.

dacorvo commented 1 year ago

I did more tests, changing the default compiler optimization option from -O2 to -O1, and I am able to use the same configurations I used with transformers-neuronx==0.6.106: batch_size=1 and n_positions=2048.

During inference, the device memory is at 64 Gb for the 13B model and 22 Gb for the 7B model.

I also tested with -O3 but got the same kind of errors.

aws-rhsoln commented 1 year ago

Thank you for reporting the issue. We are replicating the issue on our end and will get back with a fix.

santhoshkolloju commented 1 year ago

Hi What’s the throughput tokens/sec did u get on 7 billion model ?

dacorvo commented 1 year ago

With the 2.14.1 compiler (neuronx-cc), I am able to compile the llama2 7B model with -O1 for different batch sizes.

I tested several combinations of cores / batch size with the default maximum sequence length for llama model (2048).

Here are the results:

| cores/batch | 128 tokens | 512 tokens | 1024 tokens | 2048 tokens | Throughput   |
|-------------|------------|------------|-------------|-------------|--------------|
| 2c / bs2    | 8.5 s      | 34 s       | 69 s        | 143 s       | 29 tokens/s  |
| 2c / bs4    | 8.6 s      | 35 s       | 72 s        | 150 s       | 55 tokens/s  |
| 24c / bs2   | 1.3 s      | 5.4 s      | 11.5 s      | 22.8 s      | 180 tokens/s |
| 24c / bs4   | 1.4 s      | 5.8 s      | 11.5 s      | 24 s        | 341 tokens/s |

Note: I experienced extremely long compilation times for batch size 4 (more than 3 hours), even with -O1, when it takes only minutes for batch size 1 or 2.

awsilya commented 1 year ago

@dacorvo thank you for confirming. Yes, batch 4 compilation time is an issue, we are working on it and it's been tracked elsewhere. I'm closing this one.

awsilya commented 1 year ago

closing

dacorvo commented 1 year ago

On most open-source projects, issues are closed only when they have been resolved, so that users:

users reporting the issue can be notified when a fix is pushed,
new users facing the issues later can be redirected to the proper version. How can we track progress on these compilation errors now that you've closed this one ? Can you link it to the relevant issues ?

hannanjgaws commented 9 months ago

Hi @dacorvo:

We confirmed that the Llama 7B compilation error you reported is fixed in the 2.15.2 Release. Can you install the latest Neuron SDK and try re-running your script to confirm that you no longer see compilation issues for this model?

aws-neuron / transformers-neuronx

Compilation errors for llama 2 models #45