HDFGroup / hsds

Cloud-native, service based access to HDF data
https://www.hdfgroup.org/solutions/hdf-kita/
Apache License 2.0
128 stars 52 forks source link

Nodes unable to get K8s info #105

Closed bilalshaikh42 closed 2 years ago

bilalshaikh42 commented 2 years ago

Hello, I am unsure what change on our end caused this problem to start appearing but the DN and SN seem to be unable to query the K8s API to connect to each other. Here is an error we are getting:

INFO> k8s_update_dn_info
DEBUG> getting pods for namespace: dev
ERROR> Unexpected MaxRetryError exception in doHealthCheck: HTTPConnectionPool(host='localhost', port=80): Max retries exceeded with url: /api/v1/namespaces/dev/pods (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3c35c3d250>: Failed to establish a new connection: [Errno 111] Connection refused'))

We have the deployment running scoped to a particular namespace using a Role/RoleBinding instead of the ClusterRole and ClusterRole binding, but this was working just fine before. Not sure if this could be caused by any recent updates

jreadey commented 2 years ago

Not sure - did you update your Kubernetes version? This got reported a few days ago in the HDF Forum:

"dn and sn nodes were resolving to localhost. It might be related to this kubernetes-client issue 3 (https://github.com/kubernetes-client/python/issues/1284). We unsuccessfully tried pinning against a couple of different kubernetes-client versions as described in that link. What did work was an explicit call to k8s_client.Configuration().get_default_copy() in util/k8sclient.py"

I'm looking at dropping the Kubernetes package and just making an http request to the kubernetes server as a simpler approach.

bilalshaikh42 commented 2 years ago

That does sound like the same issue. GKE might have gotten a security update recently.

Another cause might be that the version of HDF/hsds was not pinned earlier so the latest pull might have changed the version. I can try to pin the deployment to some of the recent releases and see if that makes a difference unless making that change to the k8sclient.py file is a change that can be made and released directly on this repo

jreadey commented 2 years ago

I've replaced the kubernetes package code in master with http gets to the kubernetes api endpoint. Please give it a try and see if that works. You'll need to make a change to the deployment yaml to set the head_port to null. See: https://github.com/HDFGroup/hsds/blob/master/admin/kubernetes/k8s_deployment_aws.yml.
Idea is that if a head container is present, the deployment will work with each pod functioning independently (useful if say you only need read functionality). If the head port is null, the SN will gather all the pod ips and dispatch DN requests through all pods. In that case you should see the node count that hsinfo reports go up as you scale the number of pods.

bilalshaikh42 commented 2 years ago

I'll give it a try. The latest release I see on Dockerhub is v0.7.0beta6, but that was last pushed a few days ago. Are the changes available as a docker image?

On Mon, Sep 27, 2021 at 1:11 AM John Readey @.***> wrote:

I've replaced the kubernetes package code in master with http gets to the kubernetes api endpoint. Please give it a try and see if that works. You'll need to make a change to the deployment yaml to set the head_port to null. See: https://github.com/HDFGroup/hsds/blob/master/admin/kubernetes/k8s_deployment_aws.yml . Idea is that if a head container is present, the deployment will work with each pod functioning independently (useful if say you only need read functionality). If the head port is null, the SN will gather all the pod ips and dispatch DN requests through all pods. In that case you should see the node count that hsinfo reports go up as you scale the number of pods.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/HDFGroup/hsds/issues/105#issuecomment-927536588, or unsubscribe https://github.com/notifications/unsubscribe-auth/AHX4FICDFWDMR4IHT6BB75DUD74IHANCNFSM5EW2HZRA .

jreadey commented 2 years ago

I just put out an image with tag: v0.7.0beta7

bilalshaikh42 commented 2 years ago

The service is working, and the pods are able to get the ips successfully. I am getting the following warning

WARN> expected to find f:status key but got: dict_keys(['f:metadata', 'f:spec'])
jreadey commented 2 years ago

Great! You can ignore the warning. I pushed out an updated image (same tag) with the logging cleaned up.
It was quite a chore spelunking through the k8s metadata json, so I had a ton of log output initially.

bilalshaikh42 commented 2 years ago

Got it. Thank you very much!