Strange generation with Llama-3.1-70B on ml.inf2.48xlarge

juliensimon commented 1 week ago

Description

I'm deploying meta-llama/Meta-Llama-3.1-70B-Instruct on a SageMaker endpoint:

latest DJLServing container with Neuron support (0.29)
ml.inf2.48xlarge instance.

Model download and model compilation look fine. Text generation is garbage.

Expected Behavior

Correct text generation.

Text generation works when I compile the model manually on an EC2 instance with the latest Deep Learning AMI for Neuron (2.19), and load it with transformers-neuronx.

Error Message

Notebook included, with full output. Cloudwatch log for the endpoint included, although I don't see any problem there.

How to Reproduce?

Run the attached notebook.

What have you tried to solve it?

I paid a lot of attention to the model environment variables, which look good to me and in line with the docs and samples.

log-events-viewer-result.zip deploy_llama31_70b_inf2.zip

juliensimon commented 1 week ago

FWIW I see the same issue when running the deepjavalibrary/djl-serving:0.29.0-pytorch-inf2 container directly on an inf2.48xlarge instance, so this doesn't look like a SageMaker issue.

curl -X POST "http://127.0.0.1:8080/predictions/my_model" \
-H 'Content-Type: application/json' \
-d '{"inputs": "Deep learning is "}'

{"generated_text": " a000#import.........................."}

tosterberg commented 1 week ago

I am taking a look at this now and will get back to you asap.

tosterberg commented 1 week ago

Thanks @juliensimon - I have been able to reproduce and found a workaround for the moment. The issue only occurs when running batch size of 1. Short term you can up the OPTION_MAX_ROLLING_BATCH_SIZE to 2 for testing, with a maximum of 16 (just with what I have tested).

I am continuing to investigate the BS=1 scenario, as it does run a slightly different code path when compared to higher batch sizes and will work on a complete solution that can be back-ported.

juliensimon commented 1 week ago

Hi Tyler,

Happy to confirm that OPTION_MAX_ROLLING_BATCH_SIZE=2 works.

Thank you for investigating!

deepjavalibrary / djl-serving