Closed amelki closed 2 years ago
Update: in my case, the problem seems to lie in cpu.shares
. If it is lower than or equal to 512 and the quota
is -1, then n_jobs
will always be 0
Hi @amelki - did you set a resource limit/request for this this pod? if so, could you share the config?
Hi @parano , I didn't set any request/limit at first, no. But note that even if I set request/limit manuallly on a deployment, it would change cpu.shares
to say, 512, but there is still a bug in BentoML and n_jobs
will still be 0 - @aarnphm is aware of the issue and told me he is working at a fix:)
I just saw I can customize resources for all new pods here: https://github.com/bentoml/yatai-chart/blob/9dfea715a7297d4bcdd2cdc353d9b0a9c130af37/values.yaml#L77. Will give it a try, thanks !
@parano FYI, I tried several things:
1/ I added the following resources to the Deployment resource created in my yatai
namespace:
resources:
limits:
cpu: 2000m
memory: 2048Mi
requests:
cpu: 1000m
memory: 1024Mi
Then, off course, the values of the cpu/quota/period files have a better shape:
source | value |
---|---|
file /sys/fs/cgroup/cpu/cpu.cfs_quota_us |
100000 |
file /sys/fs/cgroup/cpu/cpu.cfs_period_us |
100000 |
file /sys/fs/cgroup/cpu/cpu.shares |
512 |
but as I told you, since cpu.shares
is still below 1024, the serve API still does not work, because n_jobs
is still 0 => it just confirms the bug in the BentoML code.
2/ Added the resources in my values file so that they are applied at Yatai install time.
I was expecting to retrieve these resources in the generated Deployment resource, but it's not the case
3/ I also specified the resources at deployment time, within the Yatai console (more precisely, I left things as is). Interestingly, I don't see any resource in the generated Deployment resource either.
I can report issues 2/ and 3/ in a separate issue in the Yatai repo if you think it's more appropriate
@aarnphm I have been able to properly test the fix with version 1.0.0a6.post13+gd77e009c
.
Your fix is working since I don't see the n_jobs = 0
error anymore.
Unfortunately, I stumbled upon a new issue further in the stack:
Does this ring any bell on your side ? As a reminder, my service works perfectly when I serve it on my laptop.
OK @aarnphm @parano I have some more information:
the new error ('bool' object has no attribute 'get'
) does not occur at prediction time, but at pod startup time ! Seems to be a problem when initializing the runner
I tried @aarnphm fix on top of v1.0.0-a6 (see https://github.com/amelki/BentoML/tree/test/v1.0.0-a6-with-fix-for-2372) and the good news is that my service does work now !
So it means that you've introduced some code in the main branch that breaks sklearn runners... Shall I open a new issue and close that one ?
@amelki Are you sure you're using the right branch? #1 has been an error we've seen a couple times recently but have thought we dealt with it: https://github.com/bentoml/BentoML/pull/2369
Perhaps the branch that you're deploying to your pod is the latest release, and does not contain this fix which I don't think we've release yet. Would that make sense?
Or have you walked through the steps to deploy your local fixed branch through yatai?
@timliubentoml thanks to get back to me. I'm 99% positive that I'm testing the correct version (main). I tried 3 times.
If I request the version on the pod I get: bentoml, version 1.0.0a6.post14+gc6a50e6b
Here is how I build my bento:
git clone https://github.com/bentoml/BentoML.git
python -m venv .bentoml-main
source .bentoml-main/bin/activate
pip install -e .
export BENTOML_BUNDLE_LOCAL_BUILD=True
export SETUPTOOLS_USE_DISTUTILS=stdlib
pip install -U setuptools
pip install sklearn
cd path/to/mybento
bentoml build
bentoml push mybento:myid
Here is the complete stack trace:
File "/opt/conda/lib/python3.9/site-packages/starle
tte/routing.py", line 624, in lifespan
async with self.lifespan_context(app):
File "/opt/conda/lib/python3.9/site-packages/starle
tte/routing.py", line 521, in __aenter__
await self._router.startup()
File "/opt/conda/lib/python3.9/site-packages/starle
tte/routing.py", line 603, in startup
handler()
File "/opt/conda/lib/python3.9/site-packages/bentom
l/_internal/runner/local.py", line 16, in setup
self._runner._setup() # type:
ignore[reportPrivateUsage]
File "/opt/conda/lib/python3.9/site-packages/bentom
l/_internal/frameworks/sklearn.py", line 170, in
_setup
self._model = load(self._tag,
model_store=self.model_store)
File "/opt/conda/lib/python3.9/site-packages/simple
_di/__init__.py", line 139, in _
return func(*_inject_args(bind.args),
**_inject_kwargs(bind.kwargs))
File "/opt/conda/lib/python3.9/site-packages/bentom
l/_internal/frameworks/sklearn.py", line 68, in load
model = model_store.get(tag)
AttributeError: 'bool' object has no attribute 'get'
If I build a bento using https://github.com/bentoml/BentoML/releases/tag/v1.0.0-a6 or https://github.com/amelki/BentoML/tree/test/v1.0.0-a6-with-fix-for-2372, I don't have the problem, my pods are starting correctly.
So I would say there might be a regression in one of these commits: https://github.com/bentoml/BentoML/compare/v1.0.0-a6...main
@timliubentoml I found the commit that is causing the issue : https://github.com/bentoml/BentoML/commit/f30d5290e8efb0e242727e47640e7619b13607c7. I tested the commit just before (e403eee9a9d436e92ce52dc49986cf30e9ea43dc
), and startup is OK.
Starting from this commit (f30d5290e8efb0e242727e47640e7619b13607c7
), startup is KO.
Oh, awesome, was about to respond. @larme I think we've identified the commit which is breaking this issue. Could you take a look at a fix?
Also, not sure if it's related, but I find this line suspicious: https://github.com/bentoml/BentoML/blob/d77e009cf9f70fe7cd95c620cced5c309487de1e/bentoml/_internal/frameworks/sklearn.py#L86.
model_store
does not seem to be used... shouldn't it be passed to bentoml.models.create
?
One of our developers thinks we've identified the issue. Please standby for commit and release. Will get back to you with an eta.
Thanks for the help in identifying this issue!!!
Hi @amelki! We just issued the a7 release to pipy last night. Could you try upgrading to the latest release? It should have fixed this issue
@timliubentoml @parano I could finally test my model on BentoML 1.0.0a7 with Yatai 0.2.1 on my EKS cluster and it is working just fine ! Many thanks to you and the team !
great to hear :) Let me know if you run into any other trouble
I have created a simple service:
When I run it on my laptop (MacBook Pro M1), using
everything works fine when I invoke the generated
classify
API.Now when I push this service to my Yatai server as a bento and deploy it to my K8s cluster (EKS), I get the following error when I invoke the API:
Looking at the code, the problem lies in https://github.com/bentoml/BentoML/blob/119b103e2417291b18127d64d38f092893c8de4f/bentoml/_internal/frameworks/sklearn.py#L163 In my case,
_num_threads
answers 0. Digging a bit further,resource_quota.cpu
is computed here: https://github.com/bentoml/BentoML/blob/119b103e2417291b18127d64d38f092893c8de4f/bentoml/_internal/runner/utils.py#L208. Here are the values I get on the pod running the API:/sys/fs/cgroup/cpu/cpu.cfs_quota_us
/sys/fs/cgroup/cpu/cpu.cfs_period_us
/sys/fs/cgroup/cpu/cpu.shares
os.cpu_count()
Given those values,
query_cgroup_cpu_count()
will return0.001953125
, which once rounded will end up as 0, meaningn_jobs
will alway be 0. So the call will always fail on my pods.