Open farshidtz opened 2 years ago
Maybe the ports are temporarily unavailable and we don’t give the chance for them to become available. Adding a restart delay in snap (snapcraft's apps.<app-name>.restart-delay
) or to the service (EDGEX_STARTUP_INTERVAL
env var) could help.
Got the same error locally. The port wasn't used by any process when I checked but it looks like core-command restarted 10 times within 10 seconds before (systemd) giving up.
$ sudo journalctl -n 1000 | grep "59882: bind: address already in use"
Sep 27 18:54:43 farshid-cirrus7 edgexfoundry.core-command[122356]: level=ERROR ts=2022-09-27T16:54:43.442277382Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:44 farshid-cirrus7 edgexfoundry.core-command[122598]: level=ERROR ts=2022-09-27T16:54:44.75672666Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:44 farshid-cirrus7 edgexfoundry.core-command[122646]: level=ERROR ts=2022-09-27T16:54:44.96466177Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:46 farshid-cirrus7 edgexfoundry.core-command[122701]: level=ERROR ts=2022-09-27T16:54:46.217695543Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:47 farshid-cirrus7 edgexfoundry.core-command[122750]: level=ERROR ts=2022-09-27T16:54:47.461987276Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:48 farshid-cirrus7 edgexfoundry.core-command[122799]: level=ERROR ts=2022-09-27T16:54:48.747142862Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:49 farshid-cirrus7 edgexfoundry.core-command[122852]: level=ERROR ts=2022-09-27T16:54:49.951141356Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:51 farshid-cirrus7 edgexfoundry.core-command[122904]: level=ERROR ts=2022-09-27T16:54:51.224830663Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:52 farshid-cirrus7 edgexfoundry.core-command[122952]: level=ERROR ts=2022-09-27T16:54:52.487154361Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
Sep 27 18:54:53 farshid-cirrus7 edgexfoundry.core-command[123001]: level=ERROR ts=2022-09-27T16:54:53.748887177Z app=core-command source=httpserver.go:146 msg="Web server failed: listen tcp 127.0.0.1:59882: bind: address already in use"
It happens again in GitHub action test. core-data port 59880 is busy. There is no restart attempt:
Oct 04 05:56:26 fv-az47-924 edgexfoundry.core-data[5145]: level=ERROR ts=2022-10-04T05:56:26.372629665Z app=core-data source=httpserver.go:147 msg="Web server failed: listen tcp 127.0.0.1:59880: bind: address already in use"
Another solution is to start the failed services manually, while waiting 180 seconds for the them to become available.
Port 44444 is occasionally occupied as well:
app=device-virtual source=httpserver.go:147 msg="Web server failed: listen tcp 127.0.0.1:44444: bind: address already in use"
Port 44444 is not available: https://github.com/canonical/edgex-snap-testing/pull/207#issuecomment-1547528977
We may be able to find the root cause by adding a check before starting services to:
Since the services are now all disabled by default, we have the chance to set EDGEX_STARTUP_INTERVAL before starting them for the first time, to slow down the restart intervals suggested in https://github.com/canonical/edgex-snap-testing/issues/118#issuecomment-1214683958. The downside of doing that is the reliance on a feature which gets tested only later on as part of the same testing suite.
Tests occasionally fail on Github because the bind address with default ports isn’t available. It isn’t clear if there is an error in services or if the ports are used by some other processes. We’ve seen this for 59880 (core data), 59882 (core command).
When this happens, the services restart very quickly and then disable. This is because the default restart delay from systemd is 100ms without any increments, unlike docker. See https://github.com/systemd/systemd/issues/6129
The attached logs shows core command restarting 9 times in 7 seconds before being permanently stopped:
Example test result and logs:
Related issue in tests: #79, where test port 33333 is not available.