Closed metaskills closed 2 months ago
A solution could be to allow a --warmup
CLI option when starting the server.
Sure we can remove that warmup. Thanks for bringing it to my attention. We have the mmap() progress bar called "schlep" too. As you probably figured out, that can be disabled using --log-disable
when using a terminal. It'll also be auto-disabled if you're using GPU. Lastly, it won't happen if stderr is redirected to a file.
Here's 453ms latency to generate a single token using a 14 gig model:
main jart@luna:~/llamafile$ rusage o//llama.cpp/main/main -m /weights/Mistral-7B-Instruct-v0.3.BF16.gguf --cli -n 1 --log-disable --temp 0 --special
<s> Question
took 449,432µs wall time
ballooned to 13,960,704kb in size
needed 2,054,365us cpu (50% kernel)
caused 238,571 page faults (100% memcpy)
38 context switches (52% consensual)
llamafile doesn't have much overhead of its own. It goes down to 78 milliseconds if you use a 20mb model, like this embedding model.
main jart@luna:~/llamafile$ rusage o//llama.cpp/main/main -m /weights/all-MiniLM-L6-v2.Q6_K.gguf -p orange --embedding --log-disable
-0.0317844 0.0300445 -0.0606566 0.054171 0.0384351 -0.0137378 0.193084 -0.0227727 0.0451493 -0.0866957 -0.0250678 -0.0252941 -0.0158448 -0.00979887 -0.0138599 0.0655941 0.0220919 -0.0199303 -0.145232 -0.0558298 -0.0498061 -0.00320967 -0.0345226 0.00696317 -0.106107 -0.0237895 0.0167643 0.0300393 -0.0331833 -0.101823 -0.0089697 0.00977722 0.0802716 -0.0223554 -0.0767943 -0.118327 0.0326562 -0.0308867 -0.0448472 0.0148508 -0.020563 -0.0048809 -0.00607477 0.033355 -0.00568612 0.0160373 -0.0122561 0.0225163 -0.00218899 0.0412294 0.00167923 -0.0129846 -0.0715167 -0.0263407 -0.0436248 0.0330241 -0.0337614 -0.014304 -0.0256707 0.0312791 0.0919942 -0.0440335 -0.0308184 0.0524109 -0.0132108 -0.00422794 0.0116842 -0.0541004 0.00804723 -0.0890174 0.00635702 0.0603622 0.0673633 0.0718813 0.0386784 -0.00149608 0.0865788 0.0419986 0.0461259 0.00387685 -0.0250578 0.0297764 -0.145872 0.0302247 0.0188123 0.0541922 -0.084763 0.000495586 -0.0519527 0.136152 -0.0577144 0.064632 0.0753918 -0.00302461 -0.033198 0.0183073 0.0113617 -0.0648654 -0.0638555 0.209148 0.00588307 0.0680787 -0.0403727 0.00932277 -0.0203395 -0.0937304 -0.0404827 -0.0684583 -0.00795326 0.0390955 0.0279294 -0.0403867 -0.0224484 -0.0226345 0.0156785 0.0117315 -0.000392398 0.0279004 -0.0213723 0.0498216 0.0261546 0.00646406 -0.0591086 0.0122234 -0.0749251 -0.0659139 0.0215466 -4.85664e-33 0.0722578 0.00451963 0.0429557 0.00854541 0.0581629 0.0312819 -0.0347967 -0.0745112 -0.00487645 0.0262612 -0.0616393 0.0491644 -0.0723496 -0.0175136 0.14027 0.0501096 0.0224043 0.0811346 -0.0895 0.000163827 -0.0587726 0.0999836 -0.0246047 0.0316603 -0.0484162 -0.00111274 -0.0403415 -0.0432963 -0.0156516 0.0201045 0.00339398 -0.00601431 0.042451 0.0558 -0.0327016 -0.137818 0.0116949 -0.0490471 0.0666016 0.0596681 -0.0255069 0.0230668 -0.0215572 0.111168 -0.0294185 -0.0508142 0.0644108 0.0266289 0.02384 0.000905127 -0.0861847 -0.0485339 0.00055331 0.018121 -0.0377669 -0.0166628 0.0168942 0.0540773 -0.0048163 -0.0336656 -0.0210025 0.083861 0.00834526 0.00762523 -0.0995033 -0.0281909 -0.091662 0.0011698 0.0325173 -0.0241976 -0.00604925 -0.0168084 0.117599 -0.0256133 -0.0496626 -0.0493583 0.0663069 -0.00879042 0.0143084 0.00601507 -0.0757068 -0.015899 -0.0276441 0.0453369 -0.046315 0.0715787 0.0178674 0.0554027 0.0236224 0.0236353 -0.0953411 0.0124677 0.0258042 -0.0667922 -0.0250962 3.27808e-33 0.0181095 -0.0221881 0.0096204 0.0479727 0.0456671 -0.0496416 -0.0355792 0.0309866 -0.0438628 -0.0392589 -0.0121348 -0.0275254 0.0289592 0.0530935 0.0545679 0.0793105 0.0716763 0.109363 0.0232857 0.0415179 -0.106621 -0.0469415 -0.0183794 0.0187144 0.000143419 0.0379738 0.0275034 -0.055134 -0.0616683 0.0170081 0.0217906 -0.0095864 0.0260517 0.0132381 0.0387192 0.0527441 0.120446 -0.127352 -0.0642029 0.0661494 0.0251513 -0.0708307 0.0168207 0.120362 0.00344404 0.0357786 -0.0691282 0.0378288 0.0465292 0.00535672 0.0610712 0.0041889 -0.0159695 0.0470248 -0.00576334 -0.0536096 -0.0681606 0.0330785 -0.0455144 0.0717353 0.01558 -0.0110995 -0.0111986 0.0222279 -0.0252177 -0.0103118 0.0438924 0.0341115 -0.0553085 -0.0553117 0.0481457 -0.0222958 -0.0129041 0.0200907 0.0567879 -0.0727229 -0.0202945 -0.00372828 -0.0258849 0.0817826 -0.119873 -0.00585168 0.00410673 0.0358944 -0.0016022 0.0395851 0.0257963 0.0423478 0.0191146 -0.0615523 -0.0118891 0.0438561 0.00813753 0.0490666 -0.0648762 -1.31135e-08 0.085139 -0.0613255 0.00293221 -0.0022764 -0.0085511 0.0158048 -0.080317 0.00675965 -0.013794 0.015857 0.0227918 0.0521703 0.0151762 0.0422474 0.0259102 -0.0472203 -0.0490228 -0.0133137 0.0033711 0.0440844 -0.0903427 0.0440017 -0.0182679 0.0130034 0.0229458 0.00426531 -0.0213828 0.0760957 0.045364 0.0635367 0.0328951 0.0635172 -0.0515487 0.0257204 -0.0843814 -0.0293494 0.0484707 0.0288854 -0.0666442 0.00361252 -0.0132014 -0.0146651 -0.0566963 -0.0143876 -0.0168855 0.0128348 -0.0275001 -0.00356255 -0.0821031 -0.0766623 -0.0118402 0.0385588 -0.00278496 0.104344 0.0234341 -0.0598427 0.0483119 -0.00854453 -0.0668312 0.0183582 0.0972029 0.0190171 0.0534545 0.00417919
took 78,670µs wall time
ballooned to 24,576kb in size
needed 2,658,928us cpu (81% kernel)
caused 9,054 page faults (89% memcpy)
1,074 context switches (97% consensual)
performed 21,904 read and 0 write i/o operations
I hope this helps. Reach out to me with anything you need. You're also welcome to direct message me on Discord (see the link in the README) if you don't hear back on anything in three business days.
As you probably figured out, that can be disabled using --log-disable when using a terminal.
This was confusing to me because I am starting my server process using this command:
/opt/gemma-2-9b-it.Q2_K.llamafile \
--nobrowser \
--log-disable \
--host 127.0.0.1 \
--port 8080
And I can definitely see the warmup in the logs and the slow down.
83 seconds to load a 3.6gb file. Do you have a 5400 rpm disk connected over gigabit ethernet? You're going to pay that cost no matter what you do. If it doesn't happen in the warmup, then it's going to happen to the first client you encounter over the network. The fact that you're using AWS Lambda means that the OS file system won't be able to amortize this cost.
I've reverted the change I just made because after seeing more details, I'm reasonably certain that removing warmup will be detrimental to the service you're building. If you can help me understand better what you'd like to see happen and how that's actionable on llamafile, then please let me know!
83 seconds to load a 3.6gb file. Do you have a 5400 rpm disk connected over gigabit ethernet?
lol (✖╭╮✖)
FWIW, it varies wildly. I've seen 20s too. It might be low ulimits on memory. I'm just guessing.
then it's going to happen to the first client you encounter over the network.
Which is actually desirable. If the server can be ready quicker, I can use provisioned concurrency. But because the server takes more than 10s to boot, I'm blocked from that. Does that help?
I don't know what provisioned concurrency is. But I'd assume that warmup removed, you would have some other system send the warmup request automatically, and then you'd block any user traffic from hitting the server until that request is completed. Otherwise if the first user request takes 85 seconds, your tail latency is going to be on the moon.
The Lambda Runtime and invoke model is odd. Image below. The goal for web services (anything in the INIT phase) is to start in 10s or less. In that runtime we do not have tools common in K8s that could do another warmup prior to traffic hitting the service. So that first user request will get the perf hit and yes, tail latency will be high. However, I've seen those disappear with enough traffic.
I'm not trying to build an inference architecture on Lambda, yet. I'd just like to showcase Llamafile on Lambda and get folks thinking about the portability of this project. I think I can get it working if the warmup was outside of the server start process. That way the HTTP proxy (another Lambda Extension) can connect with it during the INIT phase in under 10s. My hope too is that the change / feature request made sense for everyone. Not just the wonky things I'm doing. Hope that helps?
You have the opportunity to be the first person to productionize the brand new llamafile server v.2.0 that I'm working on. So far it has an /embedding
endpoint. Embedding models are tiny. Here's a good one you can use that's 22mb:
all-MiniLM-L6-v2.Q6_K.gguf.zip
You can then say:
make -j32 o//llamafile/server/main
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.Q6_K.gguf &
curl http://127.0.0.1:8080/embedding?prompt=hello+world
And you get:
{
"add_special": true,
"parse_special": false,
"tokens_provided": 4,
"tokens_used": 4,
"embedding": [-0.04493919, 0.017154299, -0.017763417, 0.07448399, -0.06355986, -0.06065534, 0.13253167, 0.081204176, 0.008555814, 0.03046815, 0.04551083, 0.10860967, -0.016242307, 0.004101579, -0.041045796, 0.0113923745, -0.04387224, 0.037455086, -0.06516868, -0.030158972, -0.014199414, 0.044375688, 0.019663189, -0.014969509, -0.049472645, -0.03768263, 0.050107747, -0.009327268, -0.053667273, -0.01327585, 0.043741602, 0.053064425, 0.0190503, -0.0016131222, -0.116672374, 0.060765408, -0.039683133, -0.061072234, 0.066110305, 0.057212673, 0.010475249, -0.035049397, -0.080343835, -0.09146147, 0.09561678, -0.019614108, 0.012206479, 0.028192922, 0.05992163, 0.029720828, -0.0771741, -0.06288391, -0.020098347, -0.010743567, 0.0459911, 0.025859432, 0.039434757, -0.05264431, 0.010904848, -0.028012881, 0.02374525, 0.062041078, -0.07571215, 0.13744053, 0.021624766, -0.08648439, -0.002497034, -0.07037027, -0.10528944, -0.07108713, 0.011088633, -0.011481084, -0.028293252, -0.007890491, -0.005932442, -0.012344286, -0.0109595535, -0.0032645876, -0.0023649456, 0.018673459, -0.08769426, -0.002446007, 0.01252563, 0.0018258023, -0.029998625, -0.03557183, 0.08785064, -0.032543737, -0.042549286, 0.021124562, -0.11301933, 0.034332976, -0.004828803, 0.016387476, 0.012188256, 0.0058172783, 0.03591154, 0.048713926, -0.08914432, -0.1097462, 0.04073567, 0.01997938, 0.074975684, 0.062538765, 0.017494302, -0.018181488, 0.03946002, 0.040420637, -0.008207352, -0.018139018, 0.013168835, -0.0324692, -0.008483407, 0.06435668, 0.053918466, 0.033186734, -0.048995394, -0.024699224, -0.045104366, -0.028299782, -0.025086576, 0.0053605135, -0.0030724953, 0.007501989, 0.00075993023, -0.025861925, 0.013164538, -8.42789e-33, -0.03751949, -0.00060935406, 0.01818463, 0.060445886, 0.057249576, -0.00454838, -0.087184176, -0.024801785, 0.036835834, 0.018379813, 0.038673427, -0.04266556, -0.12507263, 0.101727076, 0.037376035, 0.08824377, -0.018737393, -0.015729554, 0.09533349, 0.075878516, -0.062962264, 0.03401959, -0.012127425, 0.113746956, -0.039824422, -0.0041125747, 0.017515982, -0.013790555, 0.07989204, 0.025105074, -0.07371306, -0.011674851, 0.022027943, 0.06509151, -0.009197994, -0.008043244, -0.023458103, -0.08415743, -0.03963684, 0.013594596, -0.05823141, 0.0067144446, -0.009583412, 0.006321735, 0.024771461, 0.023584427, 0.044459235, -0.0110342465, 0.02349242, 0.0010847435, 0.059136067, 0.0012264845, 0.03378655, 0.049129665, -0.03489447, -0.014035115, 0.038949408, -0.10317868, -0.013410221, 0.012336223, -0.026134353, 0.061911713, -0.08584544, -0.024133861, -0.1194289, -0.029716695, 0.053951215, 0.09612765, 0.1082146, -0.0017689393, -0.10728463, 0.002305323, 0.052542552, -0.09490232, -0.017203419, -0.064931914, -0.041796625, 0.015041223, 0.09748269, -0.031595428, 0.001543152, -0.056436412, -0.024242735, -0.04094636, 0.12323862, 0.026265696, 0.03147266, -0.027309164, 0.046299897, -0.104778655, -0.033836756, 0.08807632, -0.053457998, 0.027746594, -0.11549485, 6.4928996e-33, 0.0657978, 0.040416345, -0.042737156, 0.01740134, 0.06335029, -0.018121772, -0.028165279, 0.07730383, -0.08981612, -0.033933606, -0.012839413, -0.018129889, 0.14099182, 0.023489056, 0.052748263, 0.043177053, 0.10888425, 0.01922531, 0.007142953, -0.007967073, -0.024551447, -0.03320833, -0.023929968, 0.032950286, -0.07368368, 0.019020025, 0.07963569, -0.02202817, -0.05426177, 0.008600079, 0.034788147, -0.008968444, -0.048602484, -0.12428896, -0.0013823424, 0.028156962, 0.003410018, 0.052016187, -0.012831952, -0.028521812, 0.017593833, -0.043004576, -0.02403608, 0.1564648, -0.031477723, -0.030975487, 0.0029324866, 0.026550235, 0.005813947, -0.07989965, -0.07071022, 0.034464963, -0.037894756, 0.086976506, -0.048087597, 0.062040742, 0.04788673, 0.004470726, 0.05594446, -0.015565888, 0.008612556, 0.023244102, 0.007845828, -0.014502636, 0.045052946, 0.0045669237, -0.05205902, -0.10034152, -0.060900856, -0.014170271, 0.033324067, -0.028532451, 0.044034753, 0.07135723, -0.06429541, -0.005717306, -0.10879912, 0.014676974, 0.036444712, -0.10165959, 0.008652384, -0.03574301, 0.011172098, 0.036123447, -0.0013997691, 0.007782806, 0.09904637, 0.049254775, 0.017006103, -0.06283049, -0.04298161, -0.022788929, 0.09389584, 0.016311547, -0.00571772, -9.76152e-9, -0.03910692, -0.03195565, -0.028496021, 0.04153002, 0.06418562, 0.052059688, -0.038764857, -0.02461702, -0.03176211, 0.019020135, 0.04780212, 0.09311188, -0.004919143, 0.026526093, 0.08771144, 0.03910133, 0.019983923, 0.013921658, -0.03573494, -0.026665179, -0.0009486654, -0.024113627, 0.038966633, -0.020944824, 0.010889862, -0.016214907, -0.02407945, 0.14222126, -0.015377086, -0.00013050997, 0.05477597, 0.05412229, -0.030746434, -0.027177932, -0.021219095, 0.08363759, 0.03324958, -0.008774529, 0.01513348, 0.029486988, -0.0334652, -0.004751453, 0.024872666, -0.037752517, -0.03377062, -0.017468425, -0.010234554, -0.02362223, 0.0735807, 0.0053404793, -0.04306921, 0.011759448, 0.00729336, 0.086781636, 0.07198262, 0.07773528, 0.029469218, -0.020530857, -0.009965061, 0.059326723, 0.023873243, -0.012661998, 0.0770656, -0.030757705]
}
Embeddings are even better than LLMs. Serving them is a perfect fit for AWS Lambda.
Oh, there's also a tokenization endpoint:
jtunn@gothbox:~$ curl http://127.0.0.1:8080/tokenize?prompt=hello+world
{
"add_special": true,
"parse_special": false,
"tokens": [
"[CLS]",
" hello",
" world",
"[SEP]"
]
}
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.Q6_K.gguf &
Aside, is there a preferred way to create a --server
? With the .llamafile or using the llamafile binary? I'm guessing the latter allows me to use unreleased features with existing .llamafiles? Also, thanks so much for all your help. I've run a few large open source projects and we users can be a chore. Appreciate ya.
o//llamafile/server/main
has to be built from source. In the future, it'll be called llamafile --server
. But right now it's a separate binary that's independent of our releases. The current llamafile --server
came from a folder named "examples" in llama.cpp. I'm designing this new server so that it'll do a better job supporting production environments such as yours. Why be a customer of OpenAI when you can build your own OpenAI? That's the dream and your support simply using it and reporting bugs would help the project get there.
BTW, If I were to build llamafile locally with the commit where warmup was disabled and it worked perfectly, would you consider opening this issue again or providing a --warmup=false
option?
@jart The no warmup option worked very well. Could we please reopen this issue? Here is what I did.
Build and install a llamafile
executable with the warmup=false
commit of 21a30bed
. This is the Docker image that is deployed to Lambda. I'm sure this could have been done differently but I just wanted to quickly test if no warmup worked well.
FROM node:20 as llamafile-builder
RUN mkdir /opt/local && \
cd /tmp && \
git clone https://github.com/Mozilla-Ocho/llamafile.git && \
cd llamafile && \
git checkout 21a30bed && \
make && \
make install PREFIX=/opt/local
FROM node:20
# ...
COPY --from=llamafile-builder /opt/local /usr/local
# ...
Nothing special here. We launch the llamafile via the CMD or ENTRYPOINT much like any other workload. The key difference for Lambda is that our liveliness INIT phase needs to be < 10s total. For the patched llamafile I moved from using the .llamafile as an all in one executable...
/opt/gemma-2-9b-it.Q2_K.llamafile --server ...
... to using the llamafile
executable. I even tested this with Gemma2's .gguf
file via bartowski
Hugging Face page. Same results, worked great.
llamafile --model /opt/gemma-2-9b-it.Q2_K.llamafile --server ...
I believe that llamafile should have a distinct "liveliness" vs. "readiness" for the server. These states are normal for containerized workloads and the names are pulled directly from K8s here. It is my hope that even though my work is around Lambda, that the distinct behaviors here are not myopic for one compute platform and generally beneficial.
/opt/gemma-2-9b-it.Q2_K.llamafile \
--server \
--warmup false \
--nobrowser \
--log-disable \
--host 127.0.0.1 \
--port 8080
Here is a suggestion on how to get there. These steps could be all in one or as needed.
1) Allow a new --warmup false
server options. Keep default to true.
2) Expose a /warmup
server endpoint. Optionally, warmup could be part of the health
check.
Sure I can do that. I'll just add it as a flag, as you suggested, since I think doing the warmup is good in general.
Thanks for your patience. I've added the warmup flag. Let me know if there's any issues with it. As for the warmup endpoint, could you just try that by sending a normal request? Contributions welcome if you want /warmup
since I'm not 100% certain it's needed.
Also this change just got shipped in our 0.8.12 release. Enjoy!
Thank you @jart. Seems to be working as expected. I updated my blog post with those notes as well. Appreciate it.
Prerequisites
Feature Description
I need a cold start or readiness check to be as fast as possible. Hope there is a way to disable the warmup
warming up the model with an empty run
when starting the server.Motivation
I am using llamafile with AWS Lambda behind the Lambda Web Adapter. The lazier I can be on init the better I can get a working instance running without hitting various 10s timeout issue.
Possible Implementation
Not sure, thinking a
--warmup=false
server option would be helpful?