Open juliensimon opened 1 week ago
FWIW I see the same issue when running the deepjavalibrary/djl-serving:0.29.0-pytorch-inf2
container directly on an inf2.48xlarge instance, so this doesn't look like a SageMaker issue.
curl -X POST "http://127.0.0.1:8080/predictions/my_model" \
-H 'Content-Type: application/json' \
-d '{"inputs": "Deep learning is "}'
{"generated_text": " a000#import.........................."}
I am taking a look at this now and will get back to you asap.
Thanks @juliensimon - I have been able to reproduce and found a workaround for the moment. The issue only occurs when running batch size of 1. Short term you can up the OPTION_MAX_ROLLING_BATCH_SIZE
to 2 for testing, with a maximum of 16 (just with what I have tested).
I am continuing to investigate the BS=1 scenario, as it does run a slightly different code path when compared to higher batch sizes and will work on a complete solution that can be back-ported.
Hi Tyler,
Happy to confirm that OPTION_MAX_ROLLING_BATCH_SIZE=2
works.
Thank you for investigating!
Description
I'm deploying meta-llama/Meta-Llama-3.1-70B-Instruct on a SageMaker endpoint:
Model download and model compilation look fine. Text generation is garbage.
Expected Behavior
Correct text generation.
Text generation works when I compile the model manually on an EC2 instance with the latest Deep Learning AMI for Neuron (2.19), and load it with transformers-neuronx.
Error Message
Notebook included, with full output. Cloudwatch log for the endpoint included, although I don't see any problem there.
How to Reproduce?
Run the attached notebook.
What have you tried to solve it?
I paid a lot of attention to the model environment variables, which look good to me and in line with the docs and samples.
log-events-viewer-result.zip deploy_llama31_70b_inf2.zip