artifacthub / hub

Find, install and publish Cloud Native packages
https://artifacthub.io
Apache License 2.0
1.67k stars 229 forks source link

[Helm chart] Add liveness/readiness #1503

Closed PPACI closed 3 years ago

PPACI commented 3 years ago

Is your feature request related to a problem? Please describe. My private instance of artifacthub froze. The webpage was responsive (might be cache) but not API call were working at all. After killing the pod, it came back to normal.

Describe the solution you'd like A liveness/readiness probe defined on hub container. I think that checking https://artifacthub.io/docs/api/#/Stats/getArtifactHubStats could be a very good check. In this case, maybe letting the user select the probe timeout and frequency in helm values could be useful.

Describe alternatives you've considered Define probe just checking that port 8000 is open. This is lighter but doesn't guarantee that app is working.

-- I can take care of it through a pull request if it can help.

tegioz commented 3 years ago

Hi @PPACI

My private instance of artifacthub froze. The webpage was responsive (might be cache) but not API call were working at all. After killing the pod, it came back to normal.

Do you have by any chance the logs of that pod? It'd be great to be able to take a look at them just in case they give us a clue of what was going on. We've never seen something like that in artifacthub.io, but if there is a bug causing this, the sooner we catch it the better 😉 Please let us know if you can reproduce this again. Setting the log level to debug would be helpful.

A liveness/readiness probe defined on hub container. I think that checking https://artifacthub.io/docs/api/#/Stats/getArtifactHubStats could be a very good check. In this case, maybe letting the user select the probe timeout and frequency in helm values could be useful.

Define probe just checking that port 8000 is open. This is lighter but doesn't guarantee that app is working.

I'd rather not to poll that endpoint often as the query it runs can be expensive, specially in large deployments. Regarding the port check, please note that this is something that can be configured in the hub.yaml file, although it hasn't been exposed yet to the chart.

What you think if we allow setting the full readinessProbe and livenessProbe blocks from the chart like we do for resources, without setting any default value for now?

-- I can take care of it through a pull request if it can help.

That sounds great, thanks!

PPACI commented 3 years ago

I think exposing the whole block with sane default would be a good compromise. I'll write a PR and attach it to this issue.

Here is the log up to when I killed the pod

2021-08-19 10:08:36
{"level":"info","cmd":"hub","time":"2021-08-19T08:08:36Z","message":"hub server stopped"}
2021-08-19 10:08:36
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":27042.893682,"time":"2021-08-19T08:08:36Z","time":"2021-08-19T08:08:36Z","message":"/api/v1/users/profile"}
2021-08-19 10:08:36
{"level":"error","cmd":"hub","handlers":"user","error":"context canceled","method":"RequireLogin","time":"2021-08-19T08:08:36Z","message":"checkSession failed"}
2021-08-19 10:08:36
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":27042.622174,"time":"2021-08-19T08:08:36Z","time":"2021-08-19T08:08:36Z","message":"/api/v1/packages/stats"}
2021-08-19 10:08:36
{"level":"error","cmd":"hub","handlers":"pkg","error":"context canceled","method":"GetStats","time":"2021-08-19T08:08:36Z"}
2021-08-19 10:08:36
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":27041.686145,"time":"2021-08-19T08:08:36Z","time":"2021-08-19T08:08:36Z","message":"/api/v1/packages/random"}
2021-08-19 10:08:36
{"level":"error","cmd":"hub","handlers":"pkg","error":"context canceled","method":"GetRandom","time":"2021-08-19T08:08:36Z"}
2021-08-19 10:08:23
{"level":"info","cmd":"hub","time":"2021-08-19T08:08:23Z","message":"hub server shutting down.."}
2021-08-19 10:08:09
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":1719,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.176705,"time":"2021-08-19T08:08:09Z","time":"2021-08-19T08:08:09Z","message":"/static/media/logo_v2.png"}
2021-08-19 10:08:09
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":2269,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.239808,"time":"2021-08-19T08:08:09Z","time":"2021-08-19T08:08:09Z","message":"/"}
2021-08-19 10:07:49
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":1810.227695,"time":"2021-08-19T08:07:49Z","time":"2021-08-19T08:07:49Z","message":"/api/v1/users/profile"}
2021-08-19 10:07:49
{"level":"error","cmd":"hub","handlers":"user","error":"context canceled","method":"RequireLogin","time":"2021-08-19T08:07:49Z","message":"checkSession failed"}
2021-08-19 10:07:49
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":1812.390363,"time":"2021-08-19T08:07:49Z","time":"2021-08-19T08:07:49Z","message":"/api/v1/packages/stats"}
2021-08-19 10:07:49
{"level":"error","cmd":"hub","handlers":"pkg","error":"context canceled","method":"GetStats","time":"2021-08-19T08:07:49Z"}
2021-08-19 10:07:49
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":1814.475527,"time":"2021-08-19T08:07:49Z","time":"2021-08-19T08:07:49Z","message":"/api/v1/packages/random"}
2021-08-19 10:07:49
{"level":"error","cmd":"hub","handlers":"pkg","error":"context canceled","method":"GetRandom","time":"2021-08-19T08:07:49Z"}
2021-08-19 10:07:48
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":1719,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.176306,"time":"2021-08-19T08:07:48Z","time":"2021-08-19T08:07:48Z","message":"/static/media/logo_v2.png"}
2021-08-19 10:07:47
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":10969.799508,"time":"2021-08-19T08:07:47Z","time":"2021-08-19T08:07:47Z","message":"/api/v1/packages/stats"}
2021-08-19 10:07:47
{"level":"error","cmd":"hub","handlers":"pkg","error":"context canceled","method":"GetStats","time":"2021-08-19T08:07:47Z"}
2021-08-19 10:07:47
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":7390.952132,"time":"2021-08-19T08:07:47Z","time":"2021-08-19T08:07:47Z","message":"/api/v1/stats"}
2021-08-19 10:07:47
{"level":"error","cmd":"hub","handlers":"stats","error":"context canceled","method":"Get","time":"2021-08-19T08:07:47Z"}
2021-08-19 10:07:47
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":10969.773507,"time":"2021-08-19T08:07:47Z","time":"2021-08-19T08:07:47Z","message":"/api/v1/users/profile"}
2021-08-19 10:07:47
{"level":"error","cmd":"hub","handlers":"user","error":"context canceled","method":"RequireLogin","time":"2021-08-19T08:07:47Z","message":"checkSession failed"}
2021-08-19 10:07:47
{"level":"error","cmd":"hub","bytes_in":"","bytes_out":15,"host":"172.19.5.3","method":"GET","port":"","status":500,"took":10970.17832,"time":"2021-08-19T08:07:47Z","time":"2021-08-19T08:07:47Z","message":"/api/v1/packages/random"}
2021-08-19 10:07:47
{"level":"error","cmd":"hub","handlers":"pkg","error":"context canceled","method":"GetRandom","time":"2021-08-19T08:07:47Z"}
2021-08-19 10:07:47
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":2269,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.259008,"time":"2021-08-19T08:07:47Z","time":"2021-08-19T08:07:47Z","message":"/"}
2021-08-19 10:07:37
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":8830,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.161305,"time":"2021-08-19T08:07:37Z","time":"2021-08-19T08:07:37Z","message":"/static/media/logo192.png"}
2021-08-19 10:07:37
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":514,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.216406,"time":"2021-08-19T08:07:37Z","time":"2021-08-19T08:07:37Z","message":"/manifest.json"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":13732,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.205206,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/static/media/cncf-sandbox-horizontal-color.png"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":14100,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.187006,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/static/media/tekton-pkg-light.svg"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":13906,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.169405,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/static/media/tinkerbell-actions-light.svg"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":13466,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.145304,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/static/media/krew-plugins-light.svg"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":6485,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.237307,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/static/media/logo/artifacthub-brand-white.svg"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":2950427,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":80.63489,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/static/js/2.0d28af81.chunk.js"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":731838,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":14.162638,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/static/js/main.f991ad56.chunk.js"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":275515,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":5.064857,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/static/css/main.0dabf968.chunk.css"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":14392,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.32641,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/static/css/2.0b5cae37.chunk.css"}
2021-08-19 10:07:36
{"level":"info","cmd":"hub","bytes_in":"","bytes_out":2269,"host":"172.19.5.3","method":"GET","port":"","status":200,"took":0.361212,"time":"2021-08-19T08:07:36Z","time":"2021-08-19T08:07:36Z","message":"/"}
tegioz commented 3 years ago

I think exposing the whole block with sane default would be a good compromise.

Cool. Let's leave the default to an empty object for now if you don't mind. I'd like to make sure we won't break any of the existing deployments after an update and there could be something I'm leaving out at the moment 🙂

I'll write a PR and attach it to this issue.

Thanks! Please bump the chart version to 1.1.1-1 (only the chart version, not the app version).

Regarding the logs, is it possible that the hub pod was not to able to reach the database instance for some reason (firewall, etc)? We you shut the pod down, there were requests in transit that were cancelled that were taking a long time (10-20 seconds). The ones that were taking a long time have in common that they all need to talk to the database. The requests that did not requiere talking to the database seemed to be served immediately though.

PPACI commented 3 years ago

The database was up and running at that time.

But I've restarted the DB a couple of minutes before. The scanner successfully ran after the DB restart and before the hub freeze.