2i2c-org / infrastructure

Infrastructure for configuring and deploying our community JupyterHubs.
https://infrastructure.2i2c.org
BSD 3-Clause "New" or "Revised" License
103 stars 62 forks source link

Need help with CLI access to SMCE clusters: nasa-veda, nasa-esdis, nasa-ghg #4466

Closed sgibson91 closed 1 month ago

sgibson91 commented 1 month ago

I was due to regenerate my passwords and access keys for these accounts, which I have done. However, when I run deployer exec aws with the new access keys, and then try to list the nodes in the cluster with k get nodes, I get the following error:

E0722 10:00:47.333335    8945 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
E0722 10:00:47.715078    8945 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
E0722 10:00:48.098226    8945 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
E0722 10:00:48.485989    8945 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
E0722 10:00:48.872520    8945 memcache.go:265] couldn't get current server API group list: the server has asked for the client to provide credentials
error: You must be logged in to the server (the server has asked for the client to provide credentials)

No amount of generating new access keys seems to fix this and I don't know what has changed on the system to be able to debug

Affected accounts

sgibson91 commented 1 month ago

I tried reinstalling the deployer with pip install -e . and got a different error:

deployer exec aws $CLUSTER_NAME $MFA_ARN $TOPT_CODE

Traceback (most recent call last):
  File "/Users/sgibson/miniconda3/envs/infrastructure/bin/deployer", line 5, in <module>
    from deployer.__main__ import main
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/__main__.py", line 9, in <module>
    import deployer.commands.generate.billing.cost_table  # noqa: F401
  File "/Users/sgibson/source/github/2i2c-org/infrastructure/deployer/commands/generate/billing/cost_table.py", line 3, in <module>
    import pandas as pd
  File "/Users/sgibson/miniconda3/envs/infrastructure/lib/python3.10/site-packages/pandas/__init__.py", line 22, in <module>
    from pandas.compat import is_numpy_dev as _is_numpy_dev  # pyright: ignore # noqa:F401
  File "/Users/sgibson/miniconda3/envs/infrastructure/lib/python3.10/site-packages/pandas/compat/__init__.py", line 25, in <module>
    from pandas.compat.numpy import (
  File "/Users/sgibson/miniconda3/envs/infrastructure/lib/python3.10/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module>
    from pandas.util.version import Version
  File "/Users/sgibson/miniconda3/envs/infrastructure/lib/python3.10/site-packages/pandas/util/__init__.py", line 2, in <module>
    from pandas.util._decorators import (  # noqa:F401
  File "/Users/sgibson/miniconda3/envs/infrastructure/lib/python3.10/site-packages/pandas/util/_decorators.py", line 14, in <module>
    from pandas._libs.properties import cache_readonly
  File "/Users/sgibson/miniconda3/envs/infrastructure/lib/python3.10/site-packages/pandas/_libs/__init__.py", line 13, in <module>
    from pandas._libs.interval import Interval
  File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
sgibson91 commented 1 month ago

I am also now having this issue with the GHG cluster now that I've updated my password and access keys there as well

yuvipanda commented 1 month ago

@sgibson91 EKS uses a configmap called aws-auth in the kube-system namespace to manage individual user auth, in addition to the IAM system.

My suggestion is to get access with use-cluster-credentials and look at that configmap, and see if your user credential information matches.

sgibson91 commented 1 month ago

I checked the configmap out using describe. It only really tells me the user ARN, user name and groups, which all matches. E.g. it doesn't tell me anything about the access keys and whether that's valid or not. I'm wondering if the problem is in the deployer?

sgibson91 commented 1 month ago

I resolved the numpy error in https://github.com/2i2c-org/infrastructure/issues/4466#issuecomment-2243397863 by upgrading the python version from 3.10 to 3.12 and running pip install -e . again, but that did not resolve the original reason why I opened this issue

sgibson91 commented 1 month ago

I'm actually wondering now if running k get nodes is the wrong test to see if deployer exec aws worked. We use that command to do things like running eksctl and terraform, not access the cluster...

yuvipanda commented 1 month ago

I use aws sts get-caller-identity to check for AWS credentials. I don't think kubectl works with that by default without a aws eks <something> command to get kubernetes credentials