Open gelbal opened 3 years ago
[gelbal] This issue has attached support thread https://jel.ly.fish/6a034c75-3e9b-4846-9392-9103c94eaf3e
I took a look at the linked KB entry and edited it to be slightly more clear.
-- Start context dump --
To summarize from that KB entry, this error occurs with an incorrectly specified user docker-compose.yml
which contains both network_mode: host
and port mappings. A container with the host network can access all ports, and a container with port mapping(s) (used with network: bridge
, for example) might specify host ports to map to container-internal ports, which results in the port is already allocated
conflict.
There is also a case where the error occurs without network_mode: host
, but there is not enough information in the KB thread to determine the cause. I've added a suggestion in KB for support agents to ask the user if there are any mapped port conflicts in their docker-compose.yml
, but without a more detailed situation where this occurs, it would be counterproductive to discuss this second case here. (However, a possible cause could be the mapped port conflicts I mentioned.)
In the ticket linked above, the user is experiencing the first error case, where both host mode and port mappings are present.
-- End context dump --
An invalid user docker-compose.yml is not the Supervisor's responsibility to fix and recover from, but I think better error communication could reduce the user's friction. I don't believe the Supervisor is the best place for this communication, because it's better UX for the user to receive their validation error before waiting for push, build, deploy, download, and update on their device.
The best place in my opinion is the CLI (or balenaAPI), wherever the user is pushing an app from. We have a module resin-compose-parse which could be extended to validate if needed.
This issue is relatively low in severity though, since it's user-originated and user-fixable.
[cywang117] This issue has attached support thread https://jel.ly.fish/d67cfe53-204a-4371-82e6-773df31158f4
[ab77] This issue has attached support thread https://jel.ly.fish/3f2b1d4e-8cb1-4ee4-98e5-2fe813b9437d
[cywang117] This issue has attached support thread https://jel.ly.fish/a46057e3-6864-43cd-8792-6a4320f43bcc
I'm seeing this happen randomly - given a deploy to ~100 devices, usually 1 or 2 of them get stuck like this. I wouldn't call this a docker-compose issue. Host mode network is not used at all. There are no conflicting mappings.
Jan 18 10:44:10 5888c56 resin-supervisor[2412]: [error] Scheduling another update attempt in 256000ms due to failure: Error: Failed to apply state transition steps. (HTTP code 500) server error - driver failed programming external connectivity on endpoint redacted_4448611_2042345 (c3400672f286f23c30242e064a7cf8f144373c1105a9febb01b20f8f9da9bd41): Bind for 0.0.0.0:9000 failed: port is already allocated Steps:["start","start","start","start","start","start","start","start"]
For the specific device in which this failed, the application doesn't actually listen in port 9000 (and no one tries to connect to it)
Sometimes this also happens to another container that listens on port 80 (which receives balena-tunneled connections sometimes). That's the only other port mapping.
Sadly I can't provide more details, this is more likely to happen on production devices (because of the amount) and since the application is down when it's in this state, we only have a few minutes to investigate and so far the only way we know to get it running again is to reboot
[gantonayde] This issue has attached support thread https://jel.ly.fish/580d81b2-398d-4f92-a616-bad814e07b91
As a balena user, I would like my device not get stuck in download error loop when port is already allocated.
While on support, I saw the following logs in a device supervisor repeating:
This
port is already allocated
part led me to this knowledge base entry: https://jel.ly.fish/4dfdf053-e226-4ec2-8143-fbf54af81666I'm opening an issue here in case my fellow supervisor mates could think of an improvement here. Could the supervisor be proactive to prevent the device got into such a state?