InternLM / lmdeploy

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
https://lmdeploy.readthedocs.io/en/latest/
Apache License 2.0
3.99k stars 363 forks source link

[Bug] v0.5.0 crashes with CUDA OOM error while v0.4.2 does not (in exactly the same scenario - 30 concurrent requests to LLama2 70B) #1943

Open josephrocca opened 1 month ago

josephrocca commented 1 month ago

Checklist

Describe the bug

The v0.4.2 official docker image could handle many concurrent requests without crashing, but v0.5.0 cannot. It crashes with:

CUDA runtime error: out of memory /opt/lmdeploy/src/turbomind/utils/allocator.h:231 

Right at the moment that requests start failing this is what the end of the DEBUG logs look like:

And here's the last 10k lines of the DEBUG logs at the moment the CUDA runtime error: out of memory line appears:

Reproduction

I reproduced this on a 1xA40 machine and a 2x4090 machine. Both work fine with openmmlab/lmdeploy:v0.4.2, and both fail with openmmlab/lmdeploy:v0.5.0.

Command:

lmdeploy serve api_server lmdeploy/llama2-chat-70b-4bit --server-port 3000 --tp $(nvidia-smi -L | wc -l) --session-len 8192 --model-format awq --model-name lmdeploy/llama2-chat-70b-4bit --enable-prefix-caching --quant-policy 4 --log-level INFO

Send 30 concurrent requests - note that in the code blow I add i at the start of the prompt to prevent prefix caching. If you don't do that, then it doesn't crash.

for(let i = 0; i < 30; i++) {
  fetch("https://127.0.0.1:3000/v1/completions", {
    "headers": {
      "content-type": "application/json",
    },
    "body": JSON.stringify({"model":"lmdeploy/llama2-chat-70b-4bit","max_tokens":1024,"temperature":0.8,"top_k":40,"top_p":0.9,"repetition_penalty":1,"min_p":0,"stream":true,"stop":[],"include_stop_str_in_output":true,"prompt":i+'THE SONNETS\n\n                    1\n\nFrom fairest creatures we desire increase,\nThat thereby beauty’s rose might never die,\nBut as the riper should by time decease,\nHis tender heir might bear his memory:\nBut thou contracted to thine own bright eyes,\nFeed’st thy light’s flame with self-substantial fuel,\nMaking a famine where abundance lies,\nThyself thy foe, to thy sweet self too cruel:\nThou that art now the world’s fresh ornament,\nAnd only herald to the gaudy spring,\nWithin thine own bud buriest thy content,\nAnd, tender churl, mak’st waste in niggarding:\n  Pity the world, or else this glutton be,\n  To eat the world’s due, by the grave and thee.\n\n\n                    2\n\nWhen forty winters shall besiege thy brow,\nAnd dig deep trenches in thy beauty’s field,\nThy youth’s proud livery so gazed on now,\nWill be a tattered weed of small worth held:\nThen being asked, where all thy beauty lies,\nWhere all the treasure of thy lusty days;\nTo say, within thine own deep sunken eyes,\nWere an all-eating shame, and thriftless praise.\nHow much more praise deserv’d thy beauty’s use,\nIf thou couldst answer ‘This fair child of mine\nShall sum my count, and make my old excuse,’\nProving his beauty by succession thine.\n  This were to be new made when thou art old,\n  And see thy blood warm when thou feel’st it cold.\n\n\n                    3\n\nLook in thy glass and tell the face thou viewest,\nNow is the time that face should form another,\nWhose fresh repair if now thou not renewest,\nThou dost beguile the world, unbless some mother.\nFor where is she so fair whose uneared womb\nDisdains the tillage of thy husbandry?\nOr who is he so fond will be the tomb\nOf his self-love to stop posterity?\nThou art thy mother’s glass and she in thee\nCalls back the lovely April of her prime,\nSo thou through windows of thine age shalt see,\nDespite of wrinkles this thy golden time.\n  But if thou live remembered not to be,\n  Die single and thine image dies with thee.\n\n\n                    4\n\nUnthrifty loveliness why dost thou spend,\nUpon thyself thy beauty’s legacy?\nNature’s bequest gives nothing but doth lend,\nAnd being frank she lends to those are free:\nThen beauteous niggard why dost thou abuse,\nThe bounteous largess given thee to give?\nProfitless usurer why dost thou use\nSo great a sum of sums yet canst not live?\nFor having traffic with thyself alone,\nThou of thyself thy sweet self dost deceive,\nThen how when nature calls thee to be gone,\nWhat acceptable audit canst thou leave?\n  Thy unused beauty must be tombed with thee,\n  Which used lives th’ executor to be.\n\n\n                    5\n\nThose hours that with gentle work did frame\nThe lovely gaze where every eye doth dwell\nWill play the tyrants to the very same,\nAnd that unfair which fairly doth excel:\nFor never-resting time leads summer on\nTo hideous winter and confounds him there,\nSap checked with frost and lusty leaves quite gone,\nBeauty o’er-snowed and bareness every where:\nThen were not summer’s distillation left\nA liquid prisoner pent in walls of glass,\nBeauty’s effect with beauty were bereft,\nNor it nor no remembrance what it was.\n  But flowers distilled though they with winter meet,\n  Leese but their show, their substance still lives sweet.\n\n\n                    6\n\nThen let not winter’s ragged hand deface,\nIn thee thy summer ere thou be distilled:\nMake sweet some vial; treasure thou some place,\nWith beauty’s treasure ere it be self-killed:\nThat use is not forbidden usury,\nWhich happies those that pay the willing loan;\nThat’s for thyself to breed another thee,\nOr ten times happier be it ten for one,\nTen times thyself were happier than thou art,\nIf ten of thine ten times refigured thee:\nThen what could death do if thou shouldst depart,\nLeaving thee living in posterity?\n  Be not self-willed for thou art much too fair,\n  To be death’s conquest and make worms thine heir.\n\n\n                    7\n\nLo in the orient when the gracious light\nLifts up his burning head, each under eye\nDoth homage to his new-appearing sight,\nServing with looks his sacred majesty,\nAnd having climbed the steep-up heavenly hill,\nResembling strong youth in his middle age,\nYet mortal looks adore his beauty still,\nAttending on his golden pilgrimage:\nBut when from highmost pitch with weary car,\nLike feeble age he reeleth from the day,\nThe eyes (fore duteous) now converted are\nFrom his low tract and look another way:\n  So thou, thyself out-going in thy noon:\n  Unlooked on diest unless thou get a son.\n\n\n                    8\n\nMusic to hear, why hear’st thou music sadly?\nSweets with sweets war not, joy delights in joy:\nWhy lov’st thou that which thou receiv’st not gladly,\nOr else receiv’st with pleasure thine annoy?\nIf the true concord of well-tuned sounds,\nBy unions married do offend thine ear,\nThey do but sweetly chide thee, who confounds\nIn singleness the parts that thou shouldst bear:\nMark how one string sweet husband to another,\nStrikes each in each by mutual ordering;\nResembling sire, and child, and happy mother,\nWho all in one, one pleasing note do sing:\n  Whose speechless song being many, seeming one,\n  Sings this to thee, ‘Thou single wilt prove none’.\n\n\n                    9\n\nIs it for fear to wet a widow’s eye,\nThat thou consum’st thyself in single life?\nAh, if thou issueless shalt hap to die,\nThe world will wail thee like a makeless wife,\nThe world will be thy widow and still weep,\nThat thou no form of thee hast left behind,\nWhen every private widow well may keep,\nBy children’s eyes, her husband’s shape in mind:\nLook what an unthrift in the world doth spend\nShifts but his place, for still the world enjoys it;\nBut beauty’s waste hath in the world an end,\nAnd kept unused the user so destroys it:\n  No love toward others in that bosom sits\n  That on himself such murd’rous shame commits.\n\n\n                    10\n\nFor shame deny that thou bear’st love to any\nWho for thyself art so unprovident.\nGrant if thou wilt, thou art beloved of many,\nBut that thou none lov’st is most evident:\nFor thou art so possessed with murd’rous hate,\nThat ’gainst thyself thou stick’st not to conspire,\nSeeking that beauteous roof to ruinate\nWhich to repair should be thy chief desire:\nO change thy thought, that I may change my mind,\nShall hate be fairer lodged than gentle love?\nBe as thy presence is gracious and kind,\nOr to thyself at least kind-hearted prove,\n  Make thee another self for love of me,\n  That beauty still may live in thine or thee.\n\n\n                    11\n\nAs fast as thou shalt wane so fast thou grow’st,\nIn one of thine, from that which thou departest,\nAnd that fresh blood which youngly thou bestow’st,\nThou mayst call thine, when thou from youth convertest,\nHerein lives wisdom, beauty, and increase,\nWithout this folly, age, and cold decay,\nIf all were minded so, the times should cease,\nAnd threescore year would make the world away:\nLet those whom nature hath not made for store,\nHarsh, featureless, and rude, barrenly perish:\nLook whom she best endowed, she gave thee more;\nWhich bounteous gift thou shouldst in bounty cherish:\n  She carved thee for her seal, and meant thereby,\n  Thou shouldst print more, not let that copy die.\n\n\n                    12\n\nWhen I do count the clock that tells the time,\nAnd see the brave day sunk in hideous night,\nWhen I behold the violet past prime,\nAnd sable curls all silvered o’er with white:\nWhen lofty trees I see barren of leaves,\nWhich erst from heat did canopy the herd\nAnd summer’s green all girded up in sheaves\nBorne on the bier with white and bristly beard:\nThen of thy beauty do I question make\nThat thou among the wastes of time must go,\nSince sweets and beauties do themselves forsake,\nAnd die as fast as they see others grow,\n  And nothing ’gainst Time’s scythe can make defence\n  Save breed to brave him, when he takes thee hence.\n\n\n                    13\n\nO that you were your self, but love you are\nNo longer yours, than you yourself here live,\nAgainst this coming end you should prepare,\nAnd your sweet semblance to some other give.\nSo should that beauty which you hold in lease\nFind no determination, then you were\nYourself again after yourself’s decease,\nWhen your sweet issue your sweet form should bear.\nWho lets so fair a house fall to decay,\nWhich husbandry in honour might uphold,\nAgainst the stormy gusts of winter’s day\nAnd barren rage of death’s eternal cold?\n  O none but unthrifts, dear my love you know,\n  You had a father, let your son say so.\n\n\n                    14\n\nNot from the stars do I my judgement pluck,\nAnd yet methinks I have astronomy,\nBut not to tell of good, or evil luck,\nOf plagues, of dearths, or seasons’ quality,\nNor can I fortune to brief minutes tell;\nPointing to each his thunder, rain and wind,\nOr say with princes if it shall go well\nBy oft predict that I in heaven find.\nBut from thine eyes my knowledge I derive,\nAnd constant stars in them I read such art\nAs truth and beauty shall together thrive\nIf from thyself, to store thou wouldst convert:\n  Or else of thee this I prognosticate,\n  Thy end is truth’s and beauty’s doom and date.\n\n\n                    15\n\nWhen I consider everything that grows\nHolds in perfection but a little moment.\nThat this huge stage presenteth nought but shows\nWhereon the stars in secret influence comment.\nWhen I perceive that men as plants increase,\nCheered and checked even by the self-same sky:\nVaunt in their'}),
    "method": "POST",
  }).then(r => r.text()).then(text => console.log(text));
}

Environment

(NVIDIA A40 Runpod machine with official Docker image: openmmlab/lmdeploy:v0.5.0)

sys.platform: linux
Python: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 2147483648
GPU 0: NVIDIA A40
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
GCC: x86_64-linux-gnu-gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
PyTorch: 2.1.0+cu118
PyTorch compiling details: PyTorch built with:
  - GCC 9.3
  - C++ Version: 201703
  - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - LAPACK is enabled (usually provided by MKL)
  - NNPACK is enabled
  - CPU capability usage: AVX512
  - CUDA Runtime 11.8
  - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_90,code=sm_90
  - CuDNN 8.7
  - Magma 2.6.1
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, 

TorchVision: 0.16.0+cu118
LMDeploy: 0.5.0+4cb3854
transformers: 4.42.3
gradio: 3.50.2
fastapi: 0.111.0
pydantic: 2.7.4
triton: 2.1.0

Error traceback

Last 10k lines: https://gist.github.com/josephrocca/3686c80f508a939dcf14c598b55db2b3

Last 300 lines:

2024-07-07T16:38:55.922607124Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922611022Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922614869Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922618794Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922622652Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922626546Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922630446Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922634306Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922638849Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922642704Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922646734Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922650683Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922654523Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922658426Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922662439Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922666296Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922670436Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922674332Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922678182Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922682059Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922685903Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922689843Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922693689Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922697533Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922701532Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922705366Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922709323Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922713166Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922720683Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922724696Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922728626Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922732573Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922736772Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922740673Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922745009Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922748953Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922752819Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922757989Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922761941Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922765769Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922769781Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922773666Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922778678Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922783763Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922789951Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922796183Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922802645Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922808825Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922815643Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922822188Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922830662Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922837295Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922843282Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922848771Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922854032Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922859532Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922864735Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922869891Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922875115Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922880488Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922886668Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922892211Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922897162Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922902562Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922908068Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922913415Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922923222Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922928745Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922934252Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922939802Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922945357Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922950955Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922956562Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922962195Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.922967420Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.922972862Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.922978392Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.922985177Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923240310Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.923252356Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923260709Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923269450Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923275530Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923282443Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.923288755Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923295965Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923301738Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923307693Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923313400Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.923319315Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923325098Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923330550Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923336349Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923341569Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.923347818Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923353885Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923360615Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923366599Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923372938Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.923379298Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923385848Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923392095Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923398142Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923404525Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923410705Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923423929Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923430412Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923436987Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923443587Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923454432Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923459029Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923463284Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923467342Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923471287Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923475227Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923479257Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923483209Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923487391Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923491289Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923495661Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923499849Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923503849Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923507661Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923511581Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923515507Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923519418Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923523311Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923527174Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923531421Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923535368Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923539251Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923543217Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923547047Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923551078Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923555001Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923559368Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923563528Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923567537Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923571454Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923575678Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923579558Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923583484Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923590424Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923594497Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923598758Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923602671Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923606618Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923610458Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923614304Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923618334Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923622161Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923626028Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923630003Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923633846Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923637798Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923641638Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923645558Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923649406Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923653253Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923657690Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923661840Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923665758Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923669586Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923673418Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923677380Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923681227Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923685097Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923688940Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923692877Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923696800Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923700646Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923704566Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923708400Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923712216Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923716176Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923719990Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923724050Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923728067Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923735173Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923739296Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923743200Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923747360Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923751203Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923756330Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923760473Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923764360Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923768266Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923772117Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923775967Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923779907Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923783763Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923787650Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923791637Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923795477Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923799590Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923803420Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923807447Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923811737Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923815565Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923819502Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923823472Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923827399Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923831325Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923835172Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923839117Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923842939Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923846852Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923850686Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923855486Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923859612Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923863426Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923867406Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923871239Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923875245Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923879192Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923883156Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923895139Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923899319Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923903179Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923907226Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923911065Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923915022Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key repetition_penalty
2024-07-07T16:38:55.923918842Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: repetition_penalty
2024-07-07T16:38:55.923922756Z [TM][DEBUG] T* turbomind::Tensor::getPtr() const [with T = std::byte] start
2024-07-07T16:38:55.923926762Z [TM][DEBUG] getPtr with type x, but data type is: f4
2024-07-07T16:38:55.923930569Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: min_length
2024-07-07T16:38:55.923934689Z [TM][DEBUG] void turbomind::DynamicDecodeLayer<T>::setup(size_t, size_t, turbomind::TensorMap*) [with T = float; size_t = long unsigned int]
2024-07-07T16:38:55.923939229Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: beam_search_diversity_rate
2024-07-07T16:38:55.923943206Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923947229Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key temperature
2024-07-07T16:38:55.923951189Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: temperature
2024-07-07T16:38:55.923955684Z [TM][DEBUG] void turbomind::TopKSamplingLayer<T>::setup(size_t, size_t, turbomind::TensorMap*) [with T = float; size_t = long unsigned int]
2024-07-07T16:38:55.923960054Z [TM][DEBUG] void turbomind::BaseSamplingLayer<T>::setup(size_t, size_t, turbomind::TensorMap*) [with T = float; size_t = long unsigned int]
2024-07-07T16:38:55.923964204Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: runtime_top_k
2024-07-07T16:38:55.923973724Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key runtime_top_k
2024-07-07T16:38:55.923977726Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: runtime_top_k
2024-07-07T16:38:55.923981706Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: runtime_top_p
2024-07-07T16:38:55.923985559Z [TM][DEBUG] turbomind::Tensor& turbomind::TensorMap::at(const string&) for key runtime_top_p
2024-07-07T16:38:55.923989446Z [TM][DEBUG] bool turbomind::TensorMap::isExist(const string&) const for key: runtime_top_p
2024-07-07T16:38:55.923993479Z [TM][DEBUG] void turbomind::TopKSamplingLayer<T>::allocateBuffer(size_t, turbomind::Tensor, turbomind::Tensor) [with T = float; size_t = long unsigned int]
2024-07-07T16:38:55.923998964Z [TM][DEBUG] void turbomind::BaseSamplingLayer<T>::allocateBuffer(size_t, turbomind::Tensor, turbomind::Tensor) [with T = float; size_t = long unsigned int]
2024-07-07T16:38:55.924003096Z [TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
2024-07-07T16:38:55.924007118Z [TM][DEBUG] ReMalloc the buffer 0x4ead64200 to release unused memory to memory pools.
2024-07-07T16:38:55.924011106Z [TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool) const
2024-07-07T16:38:55.924015061Z [TM][DEBUG] Free buffer 0x4ead64200
2024-07-07T16:38:55.924019321Z [TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
2024-07-07T16:38:55.924023318Z [TM][DEBUG] malloc buffer 0x4ead64200 with size 96
2024-07-07T16:38:55.924027188Z [TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
2024-07-07T16:38:55.924031378Z [TM][DEBUG] ReMalloc the buffer 0x4ead64400 to release unused memory to memory pools.
2024-07-07T16:38:55.924038738Z [TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool) const
2024-07-07T16:38:55.924044275Z [TM][DEBUG] Free buffer 0x4ead64400
2024-07-07T16:38:55.924048344Z [TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
2024-07-07T16:38:55.924053248Z [TM][DEBUG] malloc buffer 0x4ead64400 with size 96
2024-07-07T16:38:55.924057131Z [TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = int; size_t = long unsigned int]
2024-07-07T16:38:55.924061144Z [TM][DEBUG] ReMalloc the buffer 0x4ead64600 to release unused memory to memory pools.
2024-07-07T16:38:55.924065038Z [TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool) const
2024-07-07T16:38:55.924069155Z [TM][DEBUG] Free buffer 0x4ead64600
2024-07-07T16:38:55.924073048Z [TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
2024-07-07T16:38:55.924077061Z [TM][DEBUG] malloc buffer 0x4ead64600 with size 96
2024-07-07T16:38:55.924080871Z [TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = float; size_t = long unsigned int]
2024-07-07T16:38:55.924084975Z [TM][DEBUG] ReMalloc the buffer 0x4ec68ca00 to release unused memory to memory pools.
2024-07-07T16:38:55.924088798Z [TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool) const
2024-07-07T16:38:55.924092741Z [TM][DEBUG] Free buffer 0x4ec68ca00
2024-07-07T16:38:55.924096721Z [TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
2024-07-07T16:38:55.924100795Z [TM][DEBUG] malloc buffer 0x4ec687400 with size 3072000
2024-07-07T16:38:55.924104574Z [TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = bool; size_t = long unsigned int]
2024-07-07T16:38:55.924108671Z [TM][DEBUG] Reuse original buffer 0x4ead83c00 with size 32 and do nothing for reMalloc.
2024-07-07T16:38:55.924112488Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924116465Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924120365Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924124178Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924128011Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924132331Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924136168Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924140631Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924144505Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924148333Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924152145Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924155955Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924159763Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924163705Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924167658Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924174915Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924178923Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924182817Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924186717Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924190667Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924194517Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924198474Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924202343Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924206183Z [TM][DEBUG] T turbomind::Tensor::getVal(size_t) const [with T = unsigned int; size_t = long unsigned int] start
2024-07-07T16:38:55.924210030Z [TM][DEBUG] void* turbomind::IAllocator::reMalloc(T*, size_t, bool, bool) [with T = void; size_t = long unsigned int]
2024-07-07T16:38:55.924214807Z [TM][DEBUG] ReMalloc the buffer 0x4ec306600 since it is too small.
2024-07-07T16:38:55.924218730Z [TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool) const
2024-07-07T16:38:55.924222684Z [TM][DEBUG] Free buffer 0x4ec306600
2024-07-07T16:38:55.924226783Z [TM][DEBUG] virtual void* turbomind::Allocator<turbomind::AllocatorType::CUDA>::malloc(size_t, bool, bool)
2024-07-07T16:38:55.924230970Z terminate called after throwing an instance of 'std::runtime_error'
2024-07-07T16:38:55.924234893Z   what():  [TM][ERROR] CUDA runtime error: out of memory /opt/lmdeploy/src/turbomind/utils/allocator.h:231 
2024-07-07T16:38:55.924238923Z 
2024-07-07T16:38:55.924243283Z [TM][DEBUG] virtual void turbomind::Allocator<turbomind::AllocatorType::CUDA>::free(void**, bool) const
zhyncs commented 1 month ago

Hi. I currently do not have a 40g A100 in hand. Can I check if this issue can also be reproduced with 80g instead?

coolhok commented 1 month ago

I also encountered the same problem。when prefill n_token > 2048。 After making the following modifications, I can now work normally

src/turbomind/utils/allocator.h

image

zhyncs commented 1 month ago

I also encountered the same problem。when prefill n_token > 2048。 After making the following modifications, I can now work normally

src/turbomind/utils/allocator.h

image

cc @irexyc

coolhok commented 1 month ago

I also encountered the same problem。when prefill n_token > 2048。 After making the following modifications, I can now work normally src/turbomind/utils/allocator.h image

cc @irexyc

I forgot to add a message that I am not running on NV GPU。What specific hardware is not convenient to disclose。 I hope this information can be helpful。

josephrocca commented 1 month ago

@zhyncs I just tested this, and yes, surprisingly1 the bug is reproducible on an A100 80G too. I used the exact same commands as I mentioned above, and hit the same OOM error.


1 I guessed that it would only be reproducible on ~48GB GPUs like A40, L40, etc - due to 70B Llama2 being a "tight fit". But it seems it's not related to the fraction of the GPU's VRAM taken by the model params.

zhyncs commented 1 month ago

Ok. I'll take a look today.

zhyncs commented 1 month ago

Hi @josephrocca May you try this https://github.com/zhyncs/lmdeploy-build/releases/tag/aa07f92

josephrocca commented 1 month ago

Hi, for some reason I got an immediate OOM without any requests being served. But I'm not sure if I did something wrong when installing the whl - I'm not a Python dev, and had to talk to Claude/ChatGPT about changing the filenames so that pip would accept it as a valid whl file.

Ideally there would be a Docker tag that I could test, since then I can just paste it in Runpod for several different machines within a few seconds and easily test them all. I think you didn't appreciate how inexperienced I am here - apologies :sweat_smile: I am a humble web developer.

zhyncs commented 1 month ago

I'm sorry for any inconvenience. If it's convenient, could you please let me know your CUDA version and Python version? Thanks.

josephrocca commented 1 month ago

I used openmmlab/lmdeploy:v0.5.0 (which has Python version 3.8 IIRC) on a 2x3090 Runpod machine and I think nvidia-smi said CUDA version 12.4. But i downloaded the 11.8 nightly whl from the page you linked because IIRC this is the one used in openmmlab/lmdeploy:v0.5.0, which should work due to forwards-compatibility? And Claude AI told me to use pip with --force-reinstall to install the nightly whl over the original.

If it works fine for you, then I likely made some sort of mistake during install, so this issue can likely be safely closed, and I'll re-open if needed when testing the next released version.

zhyncs commented 1 month ago
pip3 install https://github.com/zhyncs/lmdeploy-build/releases/download/49208aa/lmdeploy-0.5.0+cu121+49208aa-cp38-cp38-manylinux2014_x86_64.whl --force-reinstall --no-deps

May you try this? Thanks. And to eliminate environmental issues, you may consider starting a new docker.

josephrocca commented 1 month ago

pip doesn't like that URL due to the +cu121+49208aa part but I removed that like I did previously (following Claude AI's advice). I tried using openmmlab/lmdeploy:v0.5.0 first, but that failed, saying something like "turbomind was not installed correctly, falling back to pytorch backend". Then I tried nvcr.io/nvidia/tritonserver:22.12-py3 since it is the base image used in the docker file in this repo. It errored with No module named 'transformers'. So then I tried installing some stuff that the Dockerfile installs:

rm /etc/apt/sources.list.d/cuda*.list && apt-get update && apt-get install -y --no-install-recommends \
    rapidjson-dev libgoogle-glog-dev gdb python3.8-venv \
    && rm -rf /var/lib/apt/lists/* && cd /opt && python3 -m venv py38
python3 -m pip install --no-cache-dir --upgrade pip setuptools==69.5.1 &&\
python3 -m pip install --no-cache-dir torch==2.1.0 torchvision==0.16.0 --index-url https://download.pytorch.org/whl/cu118 &&\
python3 -m pip install --no-cache-dir cmake packaging wheel

But I still got No module named 'transformers'. So I tried removing --no-deps from the command you gave (maybe I should have done this at the start), and then it installed and ran correctly. But then when I ran it and did the 30 concurrent requests as mentioned in my original post, I got the same error:

what():  [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/allocator.h:231

Again, I'm not sure if this is because I did something wrong during install.

Honestly I think I am not the person best suited to trying to set this up because I don't really understand it - it takes me a long time (more than an hour to do this) and I end up mostly confused :sweat_smile: I'm not in a hurry to upgrade from 0.4.2 so I will wait for the next release, and give feedback on that. Please feel free to close this issue if you do not personally observe the problem after trying the steps that I have reported in the original post.

zhyncs commented 1 month ago

Thank you for your attempt and response, I'll try to replicate it again.