Here is a cleaned up GitHub issue request:

Long context inference overhead with Llama NeuronX

I followed the Llama NeuronX tutorial to host Llama2 on Amazon EC2 with NeuronX and TorchServe. The model works well, achieving 50+ tokens/sec as expected.

Issue

However, for my use case the input contexts are 500-3000 tokens. When I provide an example 3000 token context, there is a 10-30 second overhead before the first token is generated. After the first token, the inference speed is 50 tok/sec as expected.

Attempted fixes

I have tried the following to resolve the long context overhead:

Adjusted TorchServe config values for maxWorkers, maxBatchDelay, batchSize - no improvement
Increased max_length parameter to support longer sequences - no improvement
Tried different micro_batch_size and parallelism values - no improvement
Updated all NeuronX libraries to latest versions:

model-config.yaml

minWorkers: 2
maxWorkers: 8  #did not help
maxBatchDelay: 20
responseTimeout: 1080
batchSize: 4 #did not help

handler:
    model_checkpoint_dir: "llama-2-13b-split"
    amp: "bf16"
    tp_degree: 6
    max_length: 100

#did not help either
# micro_batching:
#     micro_batch_size: 8
#     parallelism:
#         preprocess: 4
#         inference: 1
#         postprocess: 4

pip list

  torch 1.13.1+cpu
  torch-model-archiver 0.9.0b20231026
  torch-neuronx 1.13.1.1.12.1
  torch-workflow-archiver 0.2.11b20231026
  torch-xla 1.13.1+torchneuronc
  transformers-neuronx 0.8.268

Log files after startup instance

``` torchserve --ncs --start --model-store model_store --ts-config config.properties --models llama-2-13b (aws_neuron_venv_pytorch) ubuntu@ip-10-72-158-249:~/serve/examples/large_models/inferentia2/llama2$ WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance. 2023-12-04T23:54:37,499 [WARN ] main org.pytorch.serve.util.ConfigManager - Your torchserve instance can access any URL to load models. When deploying to production, make sure to limit the set of allowed_urls in config.properties 2023-12-04T23:54:37,501 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Initializing plugins manager... 2023-12-04T23:54:37,545 [INFO ] main org.pytorch.serve.metrics.configuration.MetricConfiguration - Successfully loaded metrics configuration from /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml 2023-12-04T23:54:37,683 [INFO ] main org.pytorch.serve.ModelServer - Torchserve version: 0.9.0 TS Home: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages Current directory: /home/ubuntu/serve/examples/large_models/inferentia2/llama2 Temp directory: /tmp Metrics config path: /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml Number of GPUs: 0 Number of CPUs: 96 Max heap size: 30688 M Python executable: /opt/aws_neuron_venv_pytorch/bin/python Config file: config.properties Inference address: http://127.0.0.1:8080 Management address: http://127.0.0.1:8081 Metrics address: http://127.0.0.1:8082 Model Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store Initial Models: llama-2-13b Log dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs Metrics dir: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/logs Netty threads: 0 Netty client threads: 0 Default workers per model: 96 Blacklist Regex: N/A Maximum Response Size: 6553500 Maximum Request Size: 6553500 Limit Maximum Image Pixels: true Prefer direct buffer: false Allowed Urls: [file://.*|http(s)?://.*] Custom python dependency for model allowed: true Enable metrics API: true Metrics mode: log Disable system metrics: false Workflow Store: /home/ubuntu/serve/examples/large_models/inferentia2/llama2/model_store Model config: N/A 2023-12-04T23:54:37,689 [INFO ] main org.pytorch.serve.servingsdk.impl.PluginsManager - Loading snapshot serializer plugin... 2023-12-04T23:54:37,703 [INFO ] main org.pytorch.serve.ModelServer - Loading initial models: llama-2-13b 2023-12-04T23:54:37,709 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createTempDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5 2023-12-04T23:54:37,710 [INFO ] main org.pytorch.serve.archive.model.ModelArchive - createSymbolicDir /tmp/models/6b6627abd2334517acf43ddc5e377cd5/llama-2-13b 2023-12-04T23:54:37,718 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Adding new version 1.0 for model llama-2-13b 2023-12-04T23:54:37,719 [DEBUG] main org.pytorch.serve.wlm.ModelVersionedRefs - Setting default version to 1.0 for model llama-2-13b 2023-12-04T23:54:48,067 [INFO ] main org.pytorch.serve.wlm.ModelManager - Model llama-2-13b loaded. 2023-12-04T23:54:48,067 [DEBUG] main org.pytorch.serve.wlm.ModelManager - updateModel: llama-2-13b, count: 2 2023-12-04T23:54:48,074 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9000, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml] 2023-12-04T23:54:48,074 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerLifeCycle - Worker cmdline: [/opt/aws_neuron_venv_pytorch/bin/python, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/model_service_worker.py, --sock-type, unix, --sock-name, /tmp/.ts.sock.9001, --metrics-config, /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml] 2023-12-04T23:54:48,075 [INFO ] main org.pytorch.serve.ModelServer - Initialize Inference server with: EpollServerSocketChannel. 2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Inference API bind to: http://127.0.0.1:8080 2023-12-04T23:54:48,125 [INFO ] main org.pytorch.serve.ModelServer - Initialize Management server with: EpollServerSocketChannel. 2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Management API bind to: http://127.0.0.1:8081 2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Initialize Metrics server with: EpollServerSocketChannel. 2023-12-04T23:54:48,126 [INFO ] main org.pytorch.serve.ModelServer - Metrics API bind to: http://127.0.0.1:8082 Model server started. 2023-12-04T23:54:48,272 [WARN ] pool-3-thread-1 org.pytorch.serve.metrics.MetricCollector - worker pid is not available yet. 2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088 2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40732955932617|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088 2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85419082641602|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088 2023-12-04T23:54:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088 2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:364036.0625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088 2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:12472.20703125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088 2023-12-04T23:54:48,314 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:3.9|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734088 2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9000, pid=492260 2023-12-04T23:54:48,779 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9000 2023-12-04T23:54:48,779 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - s_name_part0=/tmp/.ts.sock, s_name_part1=9001, pid=492261 2023-12-04T23:54:48,780 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Listening on port: /tmp/.ts.sock.9001 2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml. 2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492261 2023-12-04T23:54:48,786 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started. 2023-12-04T23:54:48,787 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10 2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Successfully loaded /opt/aws_neuron_venv_pytorch/lib/python3.8/site-packages/ts/configs/metrics.yaml. 2023-12-04T23:54:48,787 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change null -> WORKER_STARTED 2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - [PID]492260 2023-12-04T23:54:48,787 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch worker started. 2023-12-04T23:54:48,788 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Python runtime: 3.8.10 2023-12-04T23:54:48,788 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change null -> WORKER_STARTED 2023-12-04T23:54:48,790 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9000 2023-12-04T23:54:48,790 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Connecting to: /tmp/.ts.sock.9001 2023-12-04T23:54:48,797 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9001. 2023-12-04T23:54:48,797 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Connection accepted: /tmp/.ts.sock.9000. 2023-12-04T23:54:48,799 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799 2023-12-04T23:54:48,799 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd LOAD to backend at: 1701734088799 2023-12-04T23:54:48,833 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8 2023-12-04T23:54:48,833 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - model_name: llama-2-13b, batchSize: 8 2023-12-04T23:54:48,997 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created. 2023-12-04T23:54:49,000 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Letting libtpu.so load fail during _XLAC import. libtpu.so will be loaded from `libtpu` Python package when the ComputationClient is created. 2023-12-04T23:54:49,523 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile 2023-12-04T23:54:49,532 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Your torch version is 1.13.1+cpu which does not support torch.compile 2023-12-04T23:54:49,543 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled 2023-12-04T23:54:49,544 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled 2023-12-04T23:54:49,545 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1 2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - ONNX enabled 2023-12-04T23:54:49,553 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Torch TensorRT not enabled 2023-12-04T23:54:49,555 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Setting micro batching size: 1 2023-12-04T23:54:58,772 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model 2023-12-04T23:54:58,789 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Starting to compile the model 2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed 2023-12-04T23:55:34,910 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:34.0909 492260:492606 [6] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled? 2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed 2023-12-04T23:55:35,178 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - 2023-Dec-04 23:55:35.0178 492261:492613 [0] init.cc:138 CCOM WARN OFI plugin initNet() failed is EFA enabled? 2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:0.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148 2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40731430053711|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148 2023-12-04T23:55:48,311 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.85420608520508|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148 2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148 2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:342452.0390625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148 2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:34056.08203125|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148 2023-12-04T23:55:48,312 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:9.6|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734148 2023-12-04T23:56:01,531 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled 2023-12-04T23:56:01,537 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true 2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 72704 2023-12-04T23:56:01,538 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9000-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED 2023-12-04T23:56:01,538 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:73466.0|#WorkerName:W-9000-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161 2023-12-04T23:56:01,539 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734161 2023-12-04T23:56:02,630 [INFO ] W-9001-llama-2-13b_1.0-stdout MODEL_LOG - Model has been successfully compiled 2023-12-04T23:56:02,632 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true 2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 73799 2023-12-04T23:56:02,633 [DEBUG] W-9001-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - W-9001-llama-2-13b_1.0 State change WORKER_STARTED -> WORKER_MODEL_LOADED 2023-12-04T23:56:02,633 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerLoadTime.Milliseconds:74560.0|#WorkerName:W-9001-llama-2-13b_1.0,Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162 2023-12-04T23:56:02,634 [INFO ] W-9001-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:36.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734162 2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - CPUUtilization.Percent:9.1|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208 2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskAvailable.Gigabytes:63.40730667114258|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208 2023-12-04T23:56:48,312 [INFO ] pool-3-thread-1 TS_METRICS - DiskUsage.Gigabytes:178.8542137145996|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208 2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - DiskUtilization.Percent:73.8|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208 2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryAvailable.Megabytes:330775.37890625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208 2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUsed.Megabytes:45732.69140625|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208 2023-12-04T23:56:48,313 [INFO ] pool-3-thread-1 TS_METRICS - MemoryUtilization.Percent:12.7|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734208 ... some time later when I call the API 2023-12-05T00:00:48,437 [INFO ] epollEventLoopGroup-3-1 TS_METRICS - ts_inference_requests_total.Count:1.0|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734448 2023-12-05T00:00:48,458 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Flushing req.cmd PREDICT to backend at: 1701734448458 2023-12-05T00:00:48,461 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Backend received inference at: 1701734448 2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Preprocessing 2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - received req=At the far end of town where the Gricklegrass grows and the wind smells slowandsour when it blows and no 2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - birds ever sing excepting old crows is the Street of the Lifted Lorax 2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And deep in the Gricklegrass some people say if you look deep enough you can still see today where the 2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lorax once stood just as long as it could before somebody lifted the Lorax away 2023-12-05T00:00:48,463 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - What was the Lorax Any why was it there And why was it lifted and taken somewhere from the far end of 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - town where the Gricklegrass grows The old Onceler still lives here 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Ask him he knows 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - You wont see the Onceler Dont knock at his door He stays in his Lerkim on top of his store He stays in his 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Lerkim cold under the floor where he makes his own clothes out of miffmuffered moof And on special dank 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - midnights in August he peeks out of the shutters and sometimes he speaks and tells how the Lorax was lifted 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - away Hell tell you perhaps if youre willing to pay 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - On the end of a rope he lets down a tin pail and you have to toss in fifteen cents and a nail and the shell of a 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - greatgreatgreat grandfather snail 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then he pulls up the pail makes a most careful count to see if youve paid him the proper amount Then he 2023-12-05T00:00:48,464 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hides what you paid him away in his Snuvv his secret strange hole in his gruvvulous glove Then he grunts I 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - will call you by WhispermaPhone for the secrets I tell you are for your ears alone 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - SLUPP Down slupps the WhispermaPhone to your ear and the old Oncelers whispers are not very clear 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - since they have to come down through a snergelly hose and he sounds as if he had smallish bees up his nose 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Now Ill tell you he says with his teeth sounding gray how the Lorax got lifted and taken away It all started 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - way back such a long long time back 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Way back in the days when the grass was still green and the pond was still wet and the clouds were still clean 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - and the song of the SwomeeSwans rang out in space one morning I came to this glorious place And I first 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - saw the trees The Truffula Trees The brightcolored tufts of the Truffula Trees Mile after mile in the fresh 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - morning breeze 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And under the trees I saw Brown Barbaloots frisking about in their Barbaloot suits as the played in the 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - shade and ate Truffula Fruits From the rippulous pond came the comfortable sound of the HummingFish 2023-12-05T00:00:48,465 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - humming while splashing around 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But those trees Those trees Those Truffula Trees All my life Id been searching for trees such as these The 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - touch of their tufts was much softer than silk And they had the sweet smell of fresh butterfly milk 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I felt a great leaping of joy in my heart I knew just what Id do I unloaded my cart In no time at all I had built 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - a small shop Then I chopped down a Truffula Tree with one chop And with great skillful skill and with great 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - speedy speed I took the soft tuft And I knitted a Thneed 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - The instant Id finished I heard a gaZump I looked I saw something pop out of the stump of the tree Id 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chopped down It was sort of a man Describe himThats hard I dont know if I can He was shortish and 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - oldish and brownish and mossy And he spoke with a voice that was sharpish and bossy 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Mister He said with a sawdusty sneeze I am the Lorax I speak for the trees I speak for the trees for the trees 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - have no tongues And Im asking you sir at the top of my lungs he was very upset as he shouted and puffed 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Whats that THING youve made out of my Truffula tuft 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Look Lorax I said Theres no cause for alarm I chopped just one tree I am doing no harm Im being quite 2023-12-05T00:00:48,466 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - useful This thing is a Thneed A Thneeds a FineSomethingThatAllPeopleNeed Its a shirt Its a sock Its a 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - glove Its a hat But it has other uses Yes far beyond that You can use it for carpets For pillows For sheets 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Or curtains Or covers for bicycle seats The Lorax said Sir You are crazy with greed There is no one on earth 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - who would buy that fool Thneed 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the very next minute I proved he was wrong For just at that minute a chap came along and he thought 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - that the Thneed I had knitted was great He happily bought it for three ninetyeight I laughed at the Lorax You 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - poor stupid guy You never can tell what some people will buy 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I repeat cried the Lorax I speak for the trees 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Im busy I told him Shut up if you please I rushed cross the room and in no time at all built a radiophone I 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - put in a quick call I called all my brothers and uncles and aunts and I said listen here Heres a wonderful 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - chance for the whole Onceler Family to get mighty rich Get over here fast Take the road to North Nitch Turn 2023-12-05T00:00:48,467 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - left at Weehawken Sharp right at South Stitch 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And in no time at all in the factory I built the whole Onceler Family was working full tilt We were all knitting 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds just as busy as bees to the sound of the chopping of Truffula Trees 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Then Oh Baby Oh How my business did grow Now chopping one tree at a time was too slow So I quickly 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - invented my SuperAxeHacker which whacked off four Truffula Trees at one smacker We were making 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds four times as fast as before And that Lorax He didnt show up any more 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - But the next week he knocked on my new office door He snapped Im the Lorax who speaks for the trees 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - which you seem to be chopping as fast as you please But Im also in charge of the Brown Barbaloots who 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - played in the shade in their Barbaloot suits and happily lived eating Truffula Fruits NOWthanks to your 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - hacking my trees to the ground theres not enough Truffula Fruit to go round 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - And my poor Barbaloots are all getting the crummies because they have gas and no food in their tummies 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - They loved living here But I cant let them stay Theyll have to find food And I hope that they may Good luck 2023-12-05T00:00:48,468 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - boys he cried And he sent them away 2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I the Onceler felt sad as I watched them all go BUT business is business And business must grow 2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - regardless of crummies in tummies you know 2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - I meant no harm I most truly did not But I had to grow bigger So bigger I got I biggered my factory I 2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - biggered my roads I biggered my wagons I biggered the loads of the Thneeds I shipped out I was shipping 2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - them forth to the South To the East To the West To the North I went right on biggeringselling more 2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Thneeds And I biggered my money which everyone needs 2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 3 2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - 2023-12-05T00:00:48,469 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - This story is about 2023-12-05T00:00:48,508 [INFO ] W-9000-llama-2-13b_1.0 ACCESS_LOG - /127.0.0.1:50848 "POST /predictions/llama-2-13b HTTP/1.1" 200 73 2023-12-05T00:00:48,510 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - Requests2XX.Count:1.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734448 2023-12-05T00:00:48,511 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:08,523 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:08,590 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:08,658 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:08,725 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:08,793 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:08,860 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:08,928 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:08,995 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:09,063 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:09,130 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false ..... 2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:12,608 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: false 2023-12-05T00:01:12,609 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_LOG - Inferance 2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]HandlerTime.Milliseconds:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,1701734472,beab1a87-913c-4302-9548-c25943c30243, pattern=[METRICS] 2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_inference_latency_microseconds.Microseconds:2.4171749336E7|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472 2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - ts_queue_latency_microseconds.Microseconds:20370.777|#model_name:llama-2-13b,model_version:default|#hostname:ip-10-72-158-249,timestamp:1701734472 2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.job.Job - Waiting time ns: 20370777, Backend time ns: 24152110030 2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - QueueTime.Milliseconds:20.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472 2023-12-05T00:01:12,610 [DEBUG] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - sent a reply, jobdone: true 2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 org.pytorch.serve.wlm.WorkerThread - Backend response time: 24125 2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0 TS_METRICS - WorkerThreadTime.Milliseconds:27.0|#Level:Host|#hostname:ip-10-72-158-249,timestamp:1701734472 2023-12-05T00:01:12,610 [INFO ] W-9000-llama-2-13b_1.0-stdout MODEL_METRICS - HandlerTime.ms:24147.4|#ModelName:llama-2-13b,Level:Model|#hostname:ip-10-72-158-249,requestID:beab1a87-913c-4302-9548-c25943c30243,timestamp:1701734472 ```

Tested manual inference mode from tutorial - same overhead issue

Ask

Is there something I'm missing in the config or use of Llama NeuronX to remove the long context overhead? I would like sub-second initial token latency for 500-3000 token contexts.

The alternative is to deploy with SageMaker, but I don't have that setup because we want to rewrite infrence.py to extract logits and limit Lllama to constrained generation

Let me know if any other details would be helpful in troubleshooting this. Thanks!

aws-neuron / aws-neuron-samples

Long context inference overhead with Llama NeuronX #63

Long context inference overhead with Llama NeuronX