canonical / edgex-snap-testing

4 stars 2 forks source link

Bind address already in use #118

Open farshidtz opened 2 years ago

farshidtz commented 2 years ago

Tests occasionally fail on Github because the bind address with default ports isn’t available. It isn’t clear if there is an error in services or if the ports are used by some other processes. We’ve seen this for 59880 (core data), 59882 (core command).

When this happens, the services restart very quickly and then disable. This is because the default restart delay from systemd is 100ms without any increments, unlike docker. See https://github.com/systemd/systemd/issues/6129

The attached logs shows core command restarting 9 times in 7 seconds before being permanently stopped:

Aug 12 14:02:29 fv-az41-871 edgexfoundry.core-command[5432]: level=ERROR ts=2022-08-12T14:02:29.075056424Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
...
Aug 12 14:02:30 fv-az41-871 systemd[1]: snap.edgexfoundry.core-command.service: Main process exited, code=exited, status=1/FAILURE
Aug 12 14:02:30 fv-az41-871 systemd[1]: snap.edgexfoundry.core-command.service: Failed with result 'exit-code'.
Aug 12 14:02:30 fv-az41-871 systemd[1]: snap.edgexfoundry.core-command.service: Scheduled restart job, restart counter is at 9.
Aug 12 14:02:30 fv-az41-871 systemd[1]: Stopped Service for snap application edgexfoundry.core-command.
Aug 12 14:02:30 fv-az41-871 systemd[1]: snap.edgexfoundry.core-command.service: Start request repeated too quickly.
Aug 12 14:02:30 fv-az41-871 systemd[1]: snap.edgexfoundry.core-command.service: Failed with result 'exit-code'.
Aug 12 14:02:30 fv-az41-871 systemd[1]: Failed to start Service for snap application edgexfoundry.core-command.

Example test result and logs:

Related issue in tests: #79, where test port 33333 is not available.

farshidtz commented 2 years ago

Maybe the ports are temporarily unavailable and we don’t give the chance for them to become available. Adding a restart delay in snap (snapcraft's apps.<app-name>.restart-delay) or to the service (EDGEX_STARTUP_INTERVAL env var) could help.

farshidtz commented 1 year ago

Got the same error locally. The port wasn't used by any process when I checked but it looks like core-command restarted 10 times within 10 seconds before (systemd) giving up.

$ sudo journalctl -n 1000 | grep "59882: bind: address already in use"
Sep 27 18:54:43 farshid-cirrus7 edgexfoundry.core-command[122356]: level=ERROR ts=2022-09-27T16:54:43.442277382Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:44 farshid-cirrus7 edgexfoundry.core-command[122598]: level=ERROR ts=2022-09-27T16:54:44.75672666Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:44 farshid-cirrus7 edgexfoundry.core-command[122646]: level=ERROR ts=2022-09-27T16:54:44.96466177Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:46 farshid-cirrus7 edgexfoundry.core-command[122701]: level=ERROR ts=2022-09-27T16:54:46.217695543Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:47 farshid-cirrus7 edgexfoundry.core-command[122750]: level=ERROR ts=2022-09-27T16:54:47.461987276Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:48 farshid-cirrus7 edgexfoundry.core-command[122799]: level=ERROR ts=2022-09-27T16:54:48.747142862Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:49 farshid-cirrus7 edgexfoundry.core-command[122852]: level=ERROR ts=2022-09-27T16:54:49.951141356Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:51 farshid-cirrus7 edgexfoundry.core-command[122904]: level=ERROR ts=2022-09-27T16:54:51.224830663Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:52 farshid-cirrus7 edgexfoundry.core-command[122952]: level=ERROR ts=2022-09-27T16:54:52.487154361Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:53 farshid-cirrus7 edgexfoundry.core-command[123001]: level=ERROR ts=2022-09-27T16:54:53.748887177Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
MonicaisHer commented 1 year ago

It happens again in GitHub action test. core-data port 59880 is busy. There is no restart attempt:

Oct 04 05:56:26 fv-az47-924 edgexfoundry.core-data[5145]: level=ERROR ts=2022-10-04T05:56:26.372629665Z app=core-data source=httpserver.go:147 msg="Web server failed: listen tcp 127.0.0.1:59880: bind: address already in use"
farshidtz commented 1 year ago

Another solution is to start the failed services manually, while waiting 180 seconds for the them to become available.

MonicaisHer commented 1 year ago

Port 44444 is occasionally occupied as well:

app=device-virtual source=httpserver.go:147 msg="Web server failed: listen tcp 127.0.0.1:44444: bind: address already in use"

logs: https://github.com/canonical/edgex-snap-testing/actions/runs/4946211873/jobs/8843978702?pr=204#step:5:214

MonicaisHer commented 1 year ago

Port 44444 is not available: https://github.com/canonical/edgex-snap-testing/pull/207#issuecomment-1547528977

farshidtz commented 1 year ago

We may be able to find the root cause by adding a check before starting services to:

farshidtz commented 1 year ago

Since the services are now all disabled by default, we have the chance to set EDGEX_STARTUP_INTERVAL before starting them for the first time, to slow down the restart intervals suggested in https://github.com/canonical/edgex-snap-testing/issues/118#issuecomment-1214683958. The downside of doing that is the reliance on a feature which gets tested only later on as part of the same testing suite.