det-lab / jupyterhub-deploy-kubernetes-jetstream

CDMS JupyterHub deployment on XSEDE Jetstream
0 stars 1 forks source link

insufficient cpu error #69

Closed pibion closed 2 years ago

pibion commented 2 years ago

@zonca, @rahmanole is trying to spin up a full node but is getting the following error:

2022-04-28T22:23:20.360543Z [Warning] 0/2 nodes are available: 1 Insufficient memory, 2 Insufficient cpu.

Nobody else is using an instance right now.

pibion commented 2 years ago

More info: requesting a "default" instance reported the same error. This may be because another student logged on.

pibion commented 2 years ago

When @rahmanole tries to spawn a "tiny" instance while the other student is running a "default" he gets the same error.

pibion commented 2 years ago

And another error:

2022-04-28T22:35:54.562735Z [Warning] 0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
zonca commented 2 years ago

If you enter with your account, you can check out in the admin panel if there are leftover instances running.

On Thu, Apr 28, 2022, 15:44 pibion @.***> wrote:

And another error:

2022-04-28T22:35:54.562735Z [Warning] 0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

— Reply to this email directly, view it on GitHub https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/69#issuecomment-1112721204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4QF3CJH35JCUXXU5S3VHMIDDANCNFSM5UT6BZWQ . You are receiving this because you were mentioned.Message ID: <det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/69/1112721204@ github.com>

zonca commented 2 years ago

I must have configured improperly the limits

On Thu, Apr 28, 2022, 16:10 Andrea Zonca @.***> wrote:

If you enter with your account, you can check out in the admin panel if there are leftover instances running.

On Thu, Apr 28, 2022, 15:44 pibion @.***> wrote:

And another error:

2022-04-28T22:35:54.562735Z [Warning] 0/2 nodes are available: 1 Insufficient cpu, 1 Insufficient memory, 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

— Reply to this email directly, view it on GitHub https://github.com/det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/69#issuecomment-1112721204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4QF3CJH35JCUXXU5S3VHMIDDANCNFSM5UT6BZWQ . You are receiving this because you were mentioned.Message ID: <det-lab/jupyterhub-deploy-kubernetes-jetstream/issues/69/1112721204@ github.com>

pibion commented 2 years ago

@zonca one server is running, when @rahmanole first tried to spawn there were no servers running.

zonca commented 2 years ago

@pibion I had several leftover pods from my tests on monitoring the backups and from the backup system itself, see #64 I tested now and both 1 full node and (1 default + 1 tiny) work fine.

So I'll check again tomorrow and try to understand why those completed pods still occupy CPU slots.

Anyway, do you plan more usage, would you like to have another node deployed?

With the new kubernetes version I cannot have users on the master node, so they can only use 1 medium worker node.

pibion commented 2 years ago

I don't think we need to deploy another node, it's very rare that two people are working at the same time and when they are we have always been fine with 1 default + 1 tiny.

It would be awesome to have the autoscaling managed, though, was there something we were waiting on with Magnum?

zonca commented 2 years ago

ok, currently it seems everything is working fine. We can even have 2 default sessions running together:

  jhub                        jupyter-pibion                                 3 (37%)       0 (0%)      12884901888 (41%)  12884901888 (41%)  67m
  jhub                        jupyter-zonca                                  3 (37%)       0 (0%)      12884901888 (41%)  12884901888 (41%)  27s

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests           Limits
  --------           --------           ------
  cpu                6475m (81%)        300m (3%)
  memory             26012604416 (83%)  26626319616 (85%)

Let's keep the issue open and monitor for 1 or 2 weeks.

zonca commented 2 years ago

ok, tested again today, it seems to be working fine