bentoml / BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!
https://bentoml.com
Apache License 2.0
7.13k stars 792 forks source link

Service with sklearn model fails on my EKS cluster #2371

Closed amelki closed 2 years ago

amelki commented 2 years ago

I have created a simple service:

model_runner = bentoml.sklearn.load_runner("mymodel:latest")
svc = bentoml.Service("myservice", runners=[model_runner])

@svc.api(input=NumpyNdarray(), output=NumpyNdarray())
def classify(input_series: np.ndarray) -> np.ndarray:
    return model_runner.run(input_series)

When I run it on my laptop (MacBook Pro M1), using

bentoml serve ./service.py:svc --reload

everything works fine when I invoke the generated classify API.

Now when I push this service to my Yatai server as a bento and deploy it to my K8s cluster (EKS), I get the following error when I invoke the API:

image

Looking at the code, the problem lies in https://github.com/bentoml/BentoML/blob/119b103e2417291b18127d64d38f092893c8de4f/bentoml/_internal/frameworks/sklearn.py#L163 In my case, _num_threads answers 0. Digging a bit further, resource_quota.cpu is computed here: https://github.com/bentoml/BentoML/blob/119b103e2417291b18127d64d38f092893c8de4f/bentoml/_internal/runner/utils.py#L208. Here are the values I get on the pod running the API:

source value
file /sys/fs/cgroup/cpu/cpu.cfs_quota_us -1
file /sys/fs/cgroup/cpu/cpu.cfs_period_us 100000
file /sys/fs/cgroup/cpu/cpu.shares 2
call to os.cpu_count() 2

Given those values, query_cgroup_cpu_count() will return 0.001953125, which once rounded will end up as 0, meaning n_jobs will alway be 0. So the call will always fail on my pods.

amelki commented 2 years ago

Update: in my case, the problem seems to lie in cpu.shares. If it is lower than or equal to 512 and the quota is -1, then n_jobs will always be 0

parano commented 2 years ago

Hi @amelki - did you set a resource limit/request for this this pod? if so, could you share the config?

amelki commented 2 years ago

Hi @parano , I didn't set any request/limit at first, no. But note that even if I set request/limit manuallly on a deployment, it would change cpu.shares to say, 512, but there is still a bug in BentoML and n_jobs will still be 0 - @aarnphm is aware of the issue and told me he is working at a fix:) I just saw I can customize resources for all new pods here: https://github.com/bentoml/yatai-chart/blob/9dfea715a7297d4bcdd2cdc353d9b0a9c130af37/values.yaml#L77. Will give it a try, thanks !

amelki commented 2 years ago

@parano FYI, I tried several things:

1/ I added the following resources to the Deployment resource created in my yatai namespace:

resources:
  limits:
    cpu: 2000m
    memory: 2048Mi
  requests:
    cpu: 1000m
    memory: 1024Mi

Then, off course, the values of the cpu/quota/period files have a better shape:

source value
file /sys/fs/cgroup/cpu/cpu.cfs_quota_us 100000
file /sys/fs/cgroup/cpu/cpu.cfs_period_us 100000
file /sys/fs/cgroup/cpu/cpu.shares 512

but as I told you, since cpu.shares is still below 1024, the serve API still does not work, because n_jobs is still 0 => it just confirms the bug in the BentoML code.

2/ Added the resources in my values file so that they are applied at Yatai install time.

I was expecting to retrieve these resources in the generated Deployment resource, but it's not the case

3/ I also specified the resources at deployment time, within the Yatai console (more precisely, I left things as is). Interestingly, I don't see any resource in the generated Deployment resource either.

image

I can report issues 2/ and 3/ in a separate issue in the Yatai repo if you think it's more appropriate

aarnphm commented 2 years ago

2372 just got merged, which should address this issue.

amelki commented 2 years ago

@aarnphm I have been able to properly test the fix with version 1.0.0a6.post13+gd77e009c. Your fix is working since I don't see the n_jobs = 0 error anymore. Unfortunately, I stumbled upon a new issue further in the stack:

image

Does this ring any bell on your side ? As a reminder, my service works perfectly when I serve it on my laptop.

amelki commented 2 years ago

OK @aarnphm @parano I have some more information:

  1. the new error ('bool' object has no attribute 'get') does not occur at prediction time, but at pod startup time ! Seems to be a problem when initializing the runner

  2. I tried @aarnphm fix on top of v1.0.0-a6 (see https://github.com/amelki/BentoML/tree/test/v1.0.0-a6-with-fix-for-2372) and the good news is that my service does work now !

So it means that you've introduced some code in the main branch that breaks sklearn runners... Shall I open a new issue and close that one ?

timliubentoml commented 2 years ago

@amelki Are you sure you're using the right branch? #1 has been an error we've seen a couple times recently but have thought we dealt with it: https://github.com/bentoml/BentoML/pull/2369

Perhaps the branch that you're deploying to your pod is the latest release, and does not contain this fix which I don't think we've release yet. Would that make sense?

timliubentoml commented 2 years ago

Or have you walked through the steps to deploy your local fixed branch through yatai?

amelki commented 2 years ago

@timliubentoml thanks to get back to me. I'm 99% positive that I'm testing the correct version (main). I tried 3 times. If I request the version on the pod I get: bentoml, version 1.0.0a6.post14+gc6a50e6b Here is how I build my bento:

git clone https://github.com/bentoml/BentoML.git
python -m venv .bentoml-main
source .bentoml-main/bin/activate
pip install -e .
export BENTOML_BUNDLE_LOCAL_BUILD=True
export SETUPTOOLS_USE_DISTUTILS=stdlib
pip install -U setuptools
pip install sklearn
cd path/to/mybento
bentoml build
bentoml push mybento:myid

Here is the complete stack trace:

                             File "/opt/conda/lib/python3.9/site-packages/starle
                           tte/routing.py", line 624, in lifespan               
                               async with self.lifespan_context(app):           
                             File "/opt/conda/lib/python3.9/site-packages/starle
                           tte/routing.py", line 521, in __aenter__             
                               await self._router.startup()                     
                             File "/opt/conda/lib/python3.9/site-packages/starle
                           tte/routing.py", line 603, in startup                
                               handler()                                        
                             File "/opt/conda/lib/python3.9/site-packages/bentom
                           l/_internal/runner/local.py", line 16, in setup      
                               self._runner._setup()  # type:                   
                           ignore[reportPrivateUsage]                           
                             File "/opt/conda/lib/python3.9/site-packages/bentom
                           l/_internal/frameworks/sklearn.py", line 170, in     
                           _setup                                               
                               self._model = load(self._tag,                    
                           model_store=self.model_store)                        
                             File "/opt/conda/lib/python3.9/site-packages/simple
                           _di/__init__.py", line 139, in _                     
                               return func(*_inject_args(bind.args),            
                           **_inject_kwargs(bind.kwargs))                       
                             File "/opt/conda/lib/python3.9/site-packages/bentom
                           l/_internal/frameworks/sklearn.py", line 68, in load 
                               model = model_store.get(tag)                     
                           AttributeError: 'bool' object has no attribute 'get' 

If I build a bento using https://github.com/bentoml/BentoML/releases/tag/v1.0.0-a6 or https://github.com/amelki/BentoML/tree/test/v1.0.0-a6-with-fix-for-2372, I don't have the problem, my pods are starting correctly.

So I would say there might be a regression in one of these commits: https://github.com/bentoml/BentoML/compare/v1.0.0-a6...main

amelki commented 2 years ago

@timliubentoml I found the commit that is causing the issue : https://github.com/bentoml/BentoML/commit/f30d5290e8efb0e242727e47640e7619b13607c7. I tested the commit just before (e403eee9a9d436e92ce52dc49986cf30e9ea43dc), and startup is OK. Starting from this commit (f30d5290e8efb0e242727e47640e7619b13607c7), startup is KO.

timliubentoml commented 2 years ago

Oh, awesome, was about to respond. @larme I think we've identified the commit which is breaking this issue. Could you take a look at a fix?

amelki commented 2 years ago

Also, not sure if it's related, but I find this line suspicious: https://github.com/bentoml/BentoML/blob/d77e009cf9f70fe7cd95c620cced5c309487de1e/bentoml/_internal/frameworks/sklearn.py#L86. model_store does not seem to be used... shouldn't it be passed to bentoml.models.create ?

timliubentoml commented 2 years ago

One of our developers thinks we've identified the issue. Please standby for commit and release. Will get back to you with an eta.

Thanks for the help in identifying this issue!!!

timliubentoml commented 2 years ago

Hi @amelki! We just issued the a7 release to pipy last night. Could you try upgrading to the latest release? It should have fixed this issue

amelki commented 2 years ago

@timliubentoml @parano I could finally test my model on BentoML 1.0.0a7 with Yatai 0.2.1 on my EKS cluster and it is working just fine ! Many thanks to you and the team !

aarnphm commented 2 years ago

great to hear :) Let me know if you run into any other trouble