Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
19.36k stars 981 forks source link

Feature Request: Can the Llamafile server be ready prior to model warming? #485

Closed metaskills closed 2 months ago

metaskills commented 2 months ago

Prerequisites

Feature Description

I need a cold start or readiness check to be as fast as possible. Hope there is a way to disable the warmup warming up the model with an empty run when starting the server.

Motivation

I am using llamafile with AWS Lambda behind the Lambda Web Adapter. The lazier I can be on init the better I can get a working instance running without hitting various 10s timeout issue.

Possible Implementation

Not sure, thinking a --warmup=false server option would be helpful?

metaskills commented 2 months ago

A solution could be to allow a --warmup CLI option when starting the server.

jart commented 2 months ago

Sure we can remove that warmup. Thanks for bringing it to my attention. We have the mmap() progress bar called "schlep" too. As you probably figured out, that can be disabled using --log-disable when using a terminal. It'll also be auto-disabled if you're using GPU. Lastly, it won't happen if stderr is redirected to a file.

jart commented 2 months ago

Here's 453ms latency to generate a single token using a 14 gig model:

main jart@luna:~/llamafile$ rusage o//llama.cpp/main/main -m /weights/Mistral-7B-Instruct-v0.3.BF16.gguf --cli -n 1 --log-disable --temp 0 --special
<s> Question
took 449,432µs wall time
ballooned to 13,960,704kb in size
needed 2,054,365us cpu (50% kernel)
caused 238,571 page faults (100% memcpy)
38 context switches (52% consensual)

llamafile doesn't have much overhead of its own. It goes down to 78 milliseconds if you use a 20mb model, like this embedding model.

main jart@luna:~/llamafile$ rusage o//llama.cpp/main/main -m /weights/all-MiniLM-L6-v2.Q6_K.gguf -p orange --embedding --log-disable
-0.0317844 0.0300445 -0.0606566 0.054171 0.0384351 -0.0137378 0.193084 -0.0227727 0.0451493 -0.0866957 -0.0250678 -0.0252941 -0.0158448 -0.00979887 -0.0138599 0.0655941 0.0220919 -0.0199303 -0.145232 -0.0558298 -0.0498061 -0.00320967 -0.0345226 0.00696317 -0.106107 -0.0237895 0.0167643 0.0300393 -0.0331833 -0.101823 -0.0089697 0.00977722 0.0802716 -0.0223554 -0.0767943 -0.118327 0.0326562 -0.0308867 -0.0448472 0.0148508 -0.020563 -0.0048809 -0.00607477 0.033355 -0.00568612 0.0160373 -0.0122561 0.0225163 -0.00218899 0.0412294 0.00167923 -0.0129846 -0.0715167 -0.0263407 -0.0436248 0.0330241 -0.0337614 -0.014304 -0.0256707 0.0312791 0.0919942 -0.0440335 -0.0308184 0.0524109 -0.0132108 -0.00422794 0.0116842 -0.0541004 0.00804723 -0.0890174 0.00635702 0.0603622 0.0673633 0.0718813 0.0386784 -0.00149608 0.0865788 0.0419986 0.0461259 0.00387685 -0.0250578 0.0297764 -0.145872 0.0302247 0.0188123 0.0541922 -0.084763 0.000495586 -0.0519527 0.136152 -0.0577144 0.064632 0.0753918 -0.00302461 -0.033198 0.0183073 0.0113617 -0.0648654 -0.0638555 0.209148 0.00588307 0.0680787 -0.0403727 0.00932277 -0.0203395 -0.0937304 -0.0404827 -0.0684583 -0.00795326 0.0390955 0.0279294 -0.0403867 -0.0224484 -0.0226345 0.0156785 0.0117315 -0.000392398 0.0279004 -0.0213723 0.0498216 0.0261546 0.00646406 -0.0591086 0.0122234 -0.0749251 -0.0659139 0.0215466 -4.85664e-33 0.0722578 0.00451963 0.0429557 0.00854541 0.0581629 0.0312819 -0.0347967 -0.0745112 -0.00487645 0.0262612 -0.0616393 0.0491644 -0.0723496 -0.0175136 0.14027 0.0501096 0.0224043 0.0811346 -0.0895 0.000163827 -0.0587726 0.0999836 -0.0246047 0.0316603 -0.0484162 -0.00111274 -0.0403415 -0.0432963 -0.0156516 0.0201045 0.00339398 -0.00601431 0.042451 0.0558 -0.0327016 -0.137818 0.0116949 -0.0490471 0.0666016 0.0596681 -0.0255069 0.0230668 -0.0215572 0.111168 -0.0294185 -0.0508142 0.0644108 0.0266289 0.02384 0.000905127 -0.0861847 -0.0485339 0.00055331 0.018121 -0.0377669 -0.0166628 0.0168942 0.0540773 -0.0048163 -0.0336656 -0.0210025 0.083861 0.00834526 0.00762523 -0.0995033 -0.0281909 -0.091662 0.0011698 0.0325173 -0.0241976 -0.00604925 -0.0168084 0.117599 -0.0256133 -0.0496626 -0.0493583 0.0663069 -0.00879042 0.0143084 0.00601507 -0.0757068 -0.015899 -0.0276441 0.0453369 -0.046315 0.0715787 0.0178674 0.0554027 0.0236224 0.0236353 -0.0953411 0.0124677 0.0258042 -0.0667922 -0.0250962 3.27808e-33 0.0181095 -0.0221881 0.0096204 0.0479727 0.0456671 -0.0496416 -0.0355792 0.0309866 -0.0438628 -0.0392589 -0.0121348 -0.0275254 0.0289592 0.0530935 0.0545679 0.0793105 0.0716763 0.109363 0.0232857 0.0415179 -0.106621 -0.0469415 -0.0183794 0.0187144 0.000143419 0.0379738 0.0275034 -0.055134 -0.0616683 0.0170081 0.0217906 -0.0095864 0.0260517 0.0132381 0.0387192 0.0527441 0.120446 -0.127352 -0.0642029 0.0661494 0.0251513 -0.0708307 0.0168207 0.120362 0.00344404 0.0357786 -0.0691282 0.0378288 0.0465292 0.00535672 0.0610712 0.0041889 -0.0159695 0.0470248 -0.00576334 -0.0536096 -0.0681606 0.0330785 -0.0455144 0.0717353 0.01558 -0.0110995 -0.0111986 0.0222279 -0.0252177 -0.0103118 0.0438924 0.0341115 -0.0553085 -0.0553117 0.0481457 -0.0222958 -0.0129041 0.0200907 0.0567879 -0.0727229 -0.0202945 -0.00372828 -0.0258849 0.0817826 -0.119873 -0.00585168 0.00410673 0.0358944 -0.0016022 0.0395851 0.0257963 0.0423478 0.0191146 -0.0615523 -0.0118891 0.0438561 0.00813753 0.0490666 -0.0648762 -1.31135e-08 0.085139 -0.0613255 0.00293221 -0.0022764 -0.0085511 0.0158048 -0.080317 0.00675965 -0.013794 0.015857 0.0227918 0.0521703 0.0151762 0.0422474 0.0259102 -0.0472203 -0.0490228 -0.0133137 0.0033711 0.0440844 -0.0903427 0.0440017 -0.0182679 0.0130034 0.0229458 0.00426531 -0.0213828 0.0760957 0.045364 0.0635367 0.0328951 0.0635172 -0.0515487 0.0257204 -0.0843814 -0.0293494 0.0484707 0.0288854 -0.0666442 0.00361252 -0.0132014 -0.0146651 -0.0566963 -0.0143876 -0.0168855 0.0128348 -0.0275001 -0.00356255 -0.0821031 -0.0766623 -0.0118402 0.0385588 -0.00278496 0.104344 0.0234341 -0.0598427 0.0483119 -0.00854453 -0.0668312 0.0183582 0.0972029 0.0190171 0.0534545 0.00417919
took 78,670µs wall time
ballooned to 24,576kb in size
needed 2,658,928us cpu (81% kernel)
caused 9,054 page faults (89% memcpy)
1,074 context switches (97% consensual)
performed 21,904 read and 0 write i/o operations

I hope this helps. Reach out to me with anything you need. You're also welcome to direct message me on Discord (see the link in the README) if you don't hear back on anything in three business days.

metaskills commented 2 months ago

As you probably figured out, that can be disabled using --log-disable when using a terminal.

This was confusing to me because I am starting my server process using this command:

/opt/gemma-2-9b-it.Q2_K.llamafile \
  --nobrowser \
  --log-disable \
  --host 127.0.0.1 \
  --port 8080

And I can definitely see the warmup in the logs and the slow down.

Screenshot_2024-07-04_at_4_53_07 PM
jart commented 2 months ago

83 seconds to load a 3.6gb file. Do you have a 5400 rpm disk connected over gigabit ethernet? You're going to pay that cost no matter what you do. If it doesn't happen in the warmup, then it's going to happen to the first client you encounter over the network. The fact that you're using AWS Lambda means that the OS file system won't be able to amortize this cost.

I've reverted the change I just made because after seeing more details, I'm reasonably certain that removing warmup will be detrimental to the service you're building. If you can help me understand better what you'd like to see happen and how that's actionable on llamafile, then please let me know!

metaskills commented 2 months ago

83 seconds to load a 3.6gb file. Do you have a 5400 rpm disk connected over gigabit ethernet?

lol (✖╭╮✖)

FWIW, it varies wildly. I've seen 20s too. It might be low ulimits on memory. I'm just guessing.

then it's going to happen to the first client you encounter over the network.

Which is actually desirable. If the server can be ready quicker, I can use provisioned concurrency. But because the server takes more than 10s to boot, I'm blocked from that. Does that help?

jart commented 2 months ago

I don't know what provisioned concurrency is. But I'd assume that warmup removed, you would have some other system send the warmup request automatically, and then you'd block any user traffic from hitting the server until that request is completed. Otherwise if the first user request takes 85 seconds, your tail latency is going to be on the moon.

metaskills commented 2 months ago

The Lambda Runtime and invoke model is odd. Image below. The goal for web services (anything in the INIT phase) is to start in 10s or less. In that runtime we do not have tools common in K8s that could do another warmup prior to traffic hitting the service. So that first user request will get the perf hit and yes, tail latency will be high. However, I've seen those disappear with enough traffic.

I'm not trying to build an inference architecture on Lambda, yet. I'd just like to showcase Llamafile on Lambda and get folks thinking about the portability of this project. I think I can get it working if the warmup was outside of the server start process. That way the HTTP proxy (another Lambda Extension) can connect with it during the INIT phase in under 10s. My hope too is that the change / feature request made sense for everyone. Not just the wonky things I'm doing. Hope that helps?

XbI1J
jart commented 2 months ago

You have the opportunity to be the first person to productionize the brand new llamafile server v.2.0 that I'm working on. So far it has an /embedding endpoint. Embedding models are tiny. Here's a good one you can use that's 22mb:

all-MiniLM-L6-v2.Q6_K.gguf.zip

You can then say:

make -j32 o//llamafile/server/main
o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.Q6_K.gguf &
curl http://127.0.0.1:8080/embedding?prompt=hello+world

And you get:

{
  "add_special": true,
  "parse_special": false,
  "tokens_provided": 4,
  "tokens_used": 4,
  "embedding": [-0.04493919, 0.017154299, -0.017763417, 0.07448399, -0.06355986, -0.06065534, 0.13253167, 0.081204176, 0.008555814, 0.03046815, 0.04551083, 0.10860967, -0.016242307, 0.004101579, -0.041045796, 0.0113923745, -0.04387224, 0.037455086, -0.06516868, -0.030158972, -0.014199414, 0.044375688, 0.019663189, -0.014969509, -0.049472645, -0.03768263, 0.050107747, -0.009327268, -0.053667273, -0.01327585, 0.043741602, 0.053064425, 0.0190503, -0.0016131222, -0.116672374, 0.060765408, -0.039683133, -0.061072234, 0.066110305, 0.057212673, 0.010475249, -0.035049397, -0.080343835, -0.09146147, 0.09561678, -0.019614108, 0.012206479, 0.028192922, 0.05992163, 0.029720828, -0.0771741, -0.06288391, -0.020098347, -0.010743567, 0.0459911, 0.025859432, 0.039434757, -0.05264431, 0.010904848, -0.028012881, 0.02374525, 0.062041078, -0.07571215, 0.13744053, 0.021624766, -0.08648439, -0.002497034, -0.07037027, -0.10528944, -0.07108713, 0.011088633, -0.011481084, -0.028293252, -0.007890491, -0.005932442, -0.012344286, -0.0109595535, -0.0032645876, -0.0023649456, 0.018673459, -0.08769426, -0.002446007, 0.01252563, 0.0018258023, -0.029998625, -0.03557183, 0.08785064, -0.032543737, -0.042549286, 0.021124562, -0.11301933, 0.034332976, -0.004828803, 0.016387476, 0.012188256, 0.0058172783, 0.03591154, 0.048713926, -0.08914432, -0.1097462, 0.04073567, 0.01997938, 0.074975684, 0.062538765, 0.017494302, -0.018181488, 0.03946002, 0.040420637, -0.008207352, -0.018139018, 0.013168835, -0.0324692, -0.008483407, 0.06435668, 0.053918466, 0.033186734, -0.048995394, -0.024699224, -0.045104366, -0.028299782, -0.025086576, 0.0053605135, -0.0030724953, 0.007501989, 0.00075993023, -0.025861925, 0.013164538, -8.42789e-33, -0.03751949, -0.00060935406, 0.01818463, 0.060445886, 0.057249576, -0.00454838, -0.087184176, -0.024801785, 0.036835834, 0.018379813, 0.038673427, -0.04266556, -0.12507263, 0.101727076, 0.037376035, 0.08824377, -0.018737393, -0.015729554, 0.09533349, 0.075878516, -0.062962264, 0.03401959, -0.012127425, 0.113746956, -0.039824422, -0.0041125747, 0.017515982, -0.013790555, 0.07989204, 0.025105074, -0.07371306, -0.011674851, 0.022027943, 0.06509151, -0.009197994, -0.008043244, -0.023458103, -0.08415743, -0.03963684, 0.013594596, -0.05823141, 0.0067144446, -0.009583412, 0.006321735, 0.024771461, 0.023584427, 0.044459235, -0.0110342465, 0.02349242, 0.0010847435, 0.059136067, 0.0012264845, 0.03378655, 0.049129665, -0.03489447, -0.014035115, 0.038949408, -0.10317868, -0.013410221, 0.012336223, -0.026134353, 0.061911713, -0.08584544, -0.024133861, -0.1194289, -0.029716695, 0.053951215, 0.09612765, 0.1082146, -0.0017689393, -0.10728463, 0.002305323, 0.052542552, -0.09490232, -0.017203419, -0.064931914, -0.041796625, 0.015041223, 0.09748269, -0.031595428, 0.001543152, -0.056436412, -0.024242735, -0.04094636, 0.12323862, 0.026265696, 0.03147266, -0.027309164, 0.046299897, -0.104778655, -0.033836756, 0.08807632, -0.053457998, 0.027746594, -0.11549485, 6.4928996e-33, 0.0657978, 0.040416345, -0.042737156, 0.01740134, 0.06335029, -0.018121772, -0.028165279, 0.07730383, -0.08981612, -0.033933606, -0.012839413, -0.018129889, 0.14099182, 0.023489056, 0.052748263, 0.043177053, 0.10888425, 0.01922531, 0.007142953, -0.007967073, -0.024551447, -0.03320833, -0.023929968, 0.032950286, -0.07368368, 0.019020025, 0.07963569, -0.02202817, -0.05426177, 0.008600079, 0.034788147, -0.008968444, -0.048602484, -0.12428896, -0.0013823424, 0.028156962, 0.003410018, 0.052016187, -0.012831952, -0.028521812, 0.017593833, -0.043004576, -0.02403608, 0.1564648, -0.031477723, -0.030975487, 0.0029324866, 0.026550235, 0.005813947, -0.07989965, -0.07071022, 0.034464963, -0.037894756, 0.086976506, -0.048087597, 0.062040742, 0.04788673, 0.004470726, 0.05594446, -0.015565888, 0.008612556, 0.023244102, 0.007845828, -0.014502636, 0.045052946, 0.0045669237, -0.05205902, -0.10034152, -0.060900856, -0.014170271, 0.033324067, -0.028532451, 0.044034753, 0.07135723, -0.06429541, -0.005717306, -0.10879912, 0.014676974, 0.036444712, -0.10165959, 0.008652384, -0.03574301, 0.011172098, 0.036123447, -0.0013997691, 0.007782806, 0.09904637, 0.049254775, 0.017006103, -0.06283049, -0.04298161, -0.022788929, 0.09389584, 0.016311547, -0.00571772, -9.76152e-9, -0.03910692, -0.03195565, -0.028496021, 0.04153002, 0.06418562, 0.052059688, -0.038764857, -0.02461702, -0.03176211, 0.019020135, 0.04780212, 0.09311188, -0.004919143, 0.026526093, 0.08771144, 0.03910133, 0.019983923, 0.013921658, -0.03573494, -0.026665179, -0.0009486654, -0.024113627, 0.038966633, -0.020944824, 0.010889862, -0.016214907, -0.02407945, 0.14222126, -0.015377086, -0.00013050997, 0.05477597, 0.05412229, -0.030746434, -0.027177932, -0.021219095, 0.08363759, 0.03324958, -0.008774529, 0.01513348, 0.029486988, -0.0334652, -0.004751453, 0.024872666, -0.037752517, -0.03377062, -0.017468425, -0.010234554, -0.02362223, 0.0735807, 0.0053404793, -0.04306921, 0.011759448, 0.00729336, 0.086781636, 0.07198262, 0.07773528, 0.029469218, -0.020530857, -0.009965061, 0.059326723, 0.023873243, -0.012661998, 0.0770656, -0.030757705]
}

Embeddings are even better than LLMs. Serving them is a perfect fit for AWS Lambda.

jart commented 2 months ago

Oh, there's also a tokenization endpoint:

jtunn@gothbox:~$ curl http://127.0.0.1:8080/tokenize?prompt=hello+world
{
  "add_special": true,
  "parse_special": false,
  "tokens": [
    "[CLS]",
    " hello",
    " world",
    "[SEP]"
  ]
}
metaskills commented 2 months ago

o//llamafile/server/main -m /weights/all-MiniLM-L6-v2.Q6_K.gguf &

Aside, is there a preferred way to create a --server? With the .llamafile or using the llamafile binary? I'm guessing the latter allows me to use unreleased features with existing .llamafiles? Also, thanks so much for all your help. I've run a few large open source projects and we users can be a chore. Appreciate ya.

jart commented 2 months ago

o//llamafile/server/main has to be built from source. In the future, it'll be called llamafile --server. But right now it's a separate binary that's independent of our releases. The current llamafile --server came from a folder named "examples" in llama.cpp. I'm designing this new server so that it'll do a better job supporting production environments such as yours. Why be a customer of OpenAI when you can build your own OpenAI? That's the dream and your support simply using it and reporting bugs would help the project get there.

metaskills commented 2 months ago

BTW, If I were to build llamafile locally with the commit where warmup was disabled and it worked perfectly, would you consider opening this issue again or providing a --warmup=false option?

metaskills commented 2 months ago

@jart The no warmup option worked very well. Could we please reopen this issue? Here is what I did.

Custom Llamafile with Docker

Build and install a llamafile executable with the warmup=false commit of 21a30bed. This is the Docker image that is deployed to Lambda. I'm sure this could have been done differently but I just wanted to quickly test if no warmup worked well.

FROM node:20 as llamafile-builder
RUN mkdir /opt/local && \
    cd /tmp && \
    git clone https://github.com/Mozilla-Ocho/llamafile.git && \
    cd llamafile && \
    git checkout 21a30bed && \
    make && \
    make install PREFIX=/opt/local

FROM node:20
# ...
COPY --from=llamafile-builder /opt/local /usr/local
# ...

Lambda Runtime Init

Nothing special here. We launch the llamafile via the CMD or ENTRYPOINT much like any other workload. The key difference for Lambda is that our liveliness INIT phase needs to be < 10s total. For the patched llamafile I moved from using the .llamafile as an all in one executable...

/opt/gemma-2-9b-it.Q2_K.llamafile --server ...

... to using the llamafile executable. I even tested this with Gemma2's .gguf file via bartowski Hugging Face page. Same results, worked great.

llamafile --model /opt/gemma-2-9b-it.Q2_K.llamafile --server ...

Final Reasoning

I believe that llamafile should have a distinct "liveliness" vs. "readiness" for the server. These states are normal for containerized workloads and the names are pulled directly from K8s here. It is my hope that even though my work is around Lambda, that the distinct behaviors here are not myopic for one compute platform and generally beneficial.

/opt/gemma-2-9b-it.Q2_K.llamafile \
  --server \
  --warmup false \
  --nobrowser \
  --log-disable \
  --host 127.0.0.1 \
  --port 8080
metaskills commented 2 months ago

Here is a suggestion on how to get there. These steps could be all in one or as needed.

1) Allow a new --warmup false server options. Keep default to true. 2) Expose a /warmup server endpoint. Optionally, warmup could be part of the health check.

jart commented 2 months ago

Sure I can do that. I'll just add it as a flag, as you suggested, since I think doing the warmup is good in general.

jart commented 2 months ago

Thanks for your patience. I've added the warmup flag. Let me know if there's any issues with it. As for the warmup endpoint, could you just try that by sending a normal request? Contributions welcome if you want /warmup since I'm not 100% certain it's needed.

jart commented 2 months ago

Also this change just got shipped in our 0.8.12 release. Enjoy!

metaskills commented 2 months ago

Thank you @jart. Seems to be working as expected. I updated my blog post with those notes as well. Appreciate it.