aws-neuron / transformers-neuronx

Apache License 2.0
95 stars 28 forks source link

Compilation errors for llama 2 models #45

Closed dacorvo closed 1 year ago

dacorvo commented 1 year ago

I was previously able to compile llama 2 7B using tensor parallelism on 2 Neuron Cores, with the default n_positions=2048 and a batch_size=1.

With transformers-neuronx==0.7.84 and neuronx-cc==2.10.0.34, I get the following error:

RuntimeError: Failed compilation with ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/neuroncc_compile_workdir/5c616a4d-b0cc-4c8d-8768-df08facd8aec/model.MODULE_875d0cfa
b1be718dcdb8+8737852b.hlo.pb', '--output', '/tmp/neuroncc_compile_workdir/5c616a4d-b0cc-4c8d-8768-df08facd8aec/model.MODULE_875d0cfab1be718dcdb8+8737852b.neff', '--model-type=transformer', '--
model-type=transformer', '--verbose=35']: 2023-09-18T14:15:06Z Too many instructions after unroll for function sg0000 !

I only managed to compile the model properly setting batch_size=1 and n_positions=784.

With that configuration, the device memory during inference in neuron-top is at 20.4 G (out of 16 G x 2 cores = 32 G).

I did another test this time splitting the model on 24 Neuron Cores, and faced the same error. In that configuration however, I managed to get up to n_positions=1536.

If I try to estimate the KV cache memory requirements for the 7B models, knowing that:

It gives n_positions * token_size = 2048 * 512 = 1 GB per batch.

Considering that:

As a final note, I face the same kind of errors when using the larger llama 2 13B model. I was previously able to compile and run it just fine on 24 Neuron Cores for n_positions=2048 and batch_size=2, but now I only manage to run it with n_positions=1024 and batch_size=1.

dacorvo commented 1 year ago

I did more tests, changing the default compiler optimization option from -O2 to -O1, and I am able to use the same configurations I used with transformers-neuronx==0.6.106: batch_size=1 and n_positions=2048.

During inference, the device memory is at 64 Gb for the 13B model and 22 Gb for the 7B model.

I also tested with -O3 but got the same kind of errors.

aws-rhsoln commented 1 year ago

Thank you for reporting the issue. We are replicating the issue on our end and will get back with a fix.

santhoshkolloju commented 1 year ago

Hi What’s the throughput tokens/sec did u get on 7 billion model ?

dacorvo commented 1 year ago

With the 2.14.1 compiler (neuronx-cc), I am able to compile the llama2 7B model with -O1 for different batch sizes.

I tested several combinations of cores / batch size with the default maximum sequence length for llama model (2048).

Here are the results:

| cores/batch | 128 tokens | 512 tokens | 1024 tokens | 2048 tokens | Throughput   |
|-------------|------------|------------|-------------|-------------|--------------|
| 2c / bs2    | 8.5 s      | 34 s       | 69 s        | 143 s       | 29 tokens/s  |
| 2c / bs4    | 8.6 s      | 35 s       | 72 s        | 150 s       | 55 tokens/s  |
| 24c / bs2   | 1.3 s      | 5.4 s      | 11.5 s      | 22.8 s      | 180 tokens/s |
| 24c / bs4   | 1.4 s      | 5.8 s      | 11.5 s      | 24 s        | 341 tokens/s |

Note: I experienced extremely long compilation times for batch size 4 (more than 3 hours), even with -O1, when it takes only minutes for batch size 1 or 2.

awsilya commented 1 year ago

@dacorvo thank you for confirming. Yes, batch 4 compilation time is an issue, we are working on it and it's been tracked elsewhere. I'm closing this one.

awsilya commented 1 year ago

closing

dacorvo commented 1 year ago

On most open-source projects, issues are closed only when they have been resolved, so that users:

hannanjgaws commented 9 months ago

Hi @dacorvo:

We confirmed that the Llama 7B compilation error you reported is fixed in the 2.15.2 Release. Can you install the latest Neuron SDK and try re-running your script to confirm that you no longer see compilation issues for this model?