Stuck in **starting** state

mousetwentytwo commented 11 months ago

Add-on may appear stuck in starting state. Watchdog is advised to be turned off in this case.

It looks like the healthcheck is introduced for port 8888 hardcoded with a http curl call. Altough if HTTP service is enabled it starts on 8889, and by default it has a TCP service n 8888.

Related: Originally posted by @mousetwentytwo in https://github.com/LukasGrebe/ha-addons/issues/60#issuecomment-1637310926

Healthcheck code: https://github.com/LukasGrebe/ha-addons/blob/5dd56311f043f9238f1a3895d40f9365dd0eed21/ebusd/Dockerfile#L21

Not sure for the cause, may be unrelated to HTTP.

mainmind83 commented 11 months ago

Same problem here, after upgrade to 23.2.1

ech0-py commented 11 months ago

The same here

Upd: it seems after --interval=5m the container goes to unhealthy state and then HA suppose it as running (watchdog is off)

Danit2 commented 11 months ago

Same problem here. And when you have the watchdog on then you have a reboot every 15 minutes.

LukasGrebe commented 11 months ago

Unfortunately I can not work on the code until about about mid August. That said two thoughts:

Regarding @mousetwentytwo suggestion referenced above, Would it be a good idea to check if the deamon is up and running? Maybe checking for a known result of an ebusctl call?
Feel free to submit a pull request. I’m new to this too and need to read docs and learn how this works…

thank you for raising this issue!

cociweb commented 11 months ago

Hello, Some words about the current health check: The healthcheck is introduced with #54 as seen here

docker containers has no explicit "starting" state. It has 'created' and 'running' states. in our case we have running state:

$ docker inspect -f '{{.State.Status}}' addon_12341234_ebusd
running

The problem appears first, when the container starts and there is no proper response for curl command on http://127.0.0.1:8888 after 5 minutes as desribed here: https://github.com/LukasGrebe/ha-addons/blob/5dd56311f043f9238f1a3895d40f9365dd0eed21/ebusd/Dockerfile#L19C1-L21C50

I assume that on port 8888 the ebusd is running and it accepts only http0.9 requests (because others are fail).

So, after entering into the container with docker exec -it addon_12341234_ebusd /bin/bash you can easily check the curl command:

$curl --fail http://127.0.0.1:8888
curl: (1) Received HTTP/0.9 when not allowed

after narrow down the http request version you will get another error and it hangs by curl:

curl --http0.9 --fail-with-body http://127.0.0.1:8888
ERR: command not found

(additionally, You can eliminate the hang with '--max-time 1' parameter but it does not solve the problem.)

Anyway, the ultimate goal should be any non-error (200-OK) response from ebusd via http. I've stucked here. - I cannot get any prompt info from the daemon neither on TCP client (8888) nor on http client(8889) after authentication. So I think this (correct) direction is a dead end, more over these two ports are user configurable... - I'm assume that we are not able to check the health of the ebusd service via http requests. As a workaround we are able to check the status/availability of the container if we use another service. I would recommend an additional lightweight http service (Lighttpd or nginx) where we can curl/wget a dummy HTTP-200 answer on localhost on another port, or be more simple: a dummy shell script which always returns 0 (https://docs.docker.com/engine/reference/builder/#healthcheck)...

Additionally, don't forget, that the current image contains the version of curl 8.1.2. with several CVE-s, so it should be updated at least to version of 8.2.1 as soon as possible....

ech0-py commented 10 months ago

I cannot get any prompt info from the daemon neither on TCP client (8888) nor on http client(8889) after authentication

For TCP try echo "INFO" | nc localhost 8888

version: ebusd 23.2.p20230716
update check: revision 23.2 available
device: 192.168.88.112:9999
signal: acquired
symbol rate: 23
max symbol rate: 96
min arbitration micros: 2
max arbitration micros: 49
min symbol latency: 5
max symbol latency: 57
scan: finished
... <cropped>...

For HTTP it's curl http://localhost:8889/datatypes

  {"type": "BCD", "isbits": false, "isadjustable": false, "isignored": false, "isreverse": false, "length": 1, "result": "number"},
  {"type": "BCD:2", "isbits": false, "isadjustable": false, "isignored": false, "isreverse": false, "length": 2, "result": "number"}
... <cropped>...

I believe all we need it's change HEALTHCHECK to curl --fail http://127.0.0.1:8889/datatypes || exit 1 to prove that ebusd is still alive, but the --httpport=8889 is mandatory in such case which is present by default, but user is able to remove it and thus corrupt the healtcheck.

The other way is check using TCP way, but I'm not sure what should indicate the daemon healthiness (the "signal" status?)

Unfortunately I'm not familiar with HA addons, so I don't know how to test both approaches

cociweb commented 10 months ago

Well, according to @ech0-py suggestion, the healthcheck can be done by nc as well (instead of curl). My proposal based on the suggestion is:

HEALTHCHECK --interval=5m --timeout=3s \
   CMD nc -z localhost 8888 || exit 1

I've not tried it, but it should work. In this case port 8889 is not necessary.

LukasGrebe commented 9 months ago

@mousetwentytwo could you check if the problems persist post merge of @cociweb's fix?

tjorim commented 9 months ago

It's still there: the fix does not change anything as port 8888 is only enabled when the option to expose the http server is set.

23-09-24 21:16:13 WARNING (MainThread) [supervisor.addons.addon] Timeout while waiting for addon eBUSd to start, took more then 120 seconds

cociweb commented 9 months ago

@tjorim, Have you tried to restart the supervisor? the fix solved for me and it is healthy for hours now: since the healthcheck is inside the docker container, there is no need to expose any ports. My addon also seems to be healthy from HA as well. - It's worth to restart Supervisor&Ha-Core

If the Supervisor restart does not resolve your problem, maybe your supervisor tries to reach a dead/renamed docker container.. In this case, please, try to reinstall your addon - maybe something messed up for you. (As mentioned above, by default 8888 is used for tcp service and http service is optional and by default it uses 8889. as tcp service runs always, the container NetCats it's localhost, so no need any further network config than the defaults)

Danit2 commented 9 months ago

For me it works. But you must restart your system or the supervisor.
Thanks for the work.

ech0-py commented 9 months ago

Yep, fix work, but consider that you should wait for 5 minutes until container becomes alive according to HEALTHCHECK --interval=5m, until then you'll see "starting" status and spinner in UI

LukasGrebe commented 7 months ago

@ech0-py should we reduce the interval to say 10s or close this ticket as resolved?

cociweb commented 7 months ago

Well, I've also faced this 5min stuff today. In the next PR we can add a function where the first query issued after the first 90 secs. (In my opinion at least 1 min is required to start it up on slower environments at least after fresh install...) My recommendation is to keep the 5min as default interval.

LukasGrebe / ha-addons

Stuck in starting state #61

LukasGrebe / ha-addons

Stuck in **starting** state #61

Stuck in starting state #61