balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
148 stars 63 forks source link

Device gets stuck in download loop with error `port is already allocated` #1634

Open gelbal opened 3 years ago

gelbal commented 3 years ago

As a balena user, I would like my device not get stuck in download error loop when port is already allocated.


While on support, I saw the following logs in a device supervisor repeating:

Mar 29 11:58:01 resin-supervisor[2814]: [error]   Updating failed, but there's another update scheduled immediately:  Error: Failed to apply state transition steps. (HTTP code 500) server error - driver failed programming external connectivity on endpoint localFeedForwarder_3408910_1734280 (edf02eb838e0e0f98c4a470051eb8a6c302ad2bfd1fc9c98a62c1ff5e76d19d4): Bind for 0.0.0.0:65505 failed: port is already allocated  Steps:["kill","start","fetch","start","fetch"]
Mar 29 11:58:01 resin-supervisor[2814]: [error]         at fn (/usr/src/app/dist/app.js:6:8488)
Mar 29 11:58:01 resin-supervisor[2814]: [error]       at processTicksAndRejections (internal/process/task_queues.js:97:5)
Mar 29 11:58:01 resin-supervisor[2814]: [error]   Device state apply error Error: Failed to apply state transition steps. (HTTP code 500) server error - driver failed programming external connectivity on endpoint localFeedForwarder_3408910_1734280 (edf02eb838e0e0f98c4a470051eb8a6c302ad2bfd1fc9c98a62c1ff5e76d19d4): Bind for 0.0.0.0:65505 failed: port is already allocated  Steps:["kill","start","fetch","start","fetch"]
Mar 29 11:58:01 resin-supervisor[2814]: [error]         at fn (/usr/src/app/dist/app.js:6:8488)
Mar 29 11:58:01 resin-supervisor[2814]: [error]       at processTicksAndRejections (internal/process/task_queues.js:97:5)

This port is already allocated part led me to this knowledge base entry: https://jel.ly.fish/4dfdf053-e226-4ec2-8143-fbf54af81666


I'm opening an issue here in case my fellow supervisor mates could think of an improvement here. Could the supervisor be proactive to prevent the device got into such a state?

jellyfish-bot commented 3 years ago

[gelbal] This issue has attached support thread https://jel.ly.fish/6a034c75-3e9b-4846-9392-9103c94eaf3e

cywang117 commented 3 years ago

I took a look at the linked KB entry and edited it to be slightly more clear.

-- Start context dump --

To summarize from that KB entry, this error occurs with an incorrectly specified user docker-compose.yml which contains both network_mode: host and port mappings. A container with the host network can access all ports, and a container with port mapping(s) (used with network: bridge, for example) might specify host ports to map to container-internal ports, which results in the port is already allocated conflict.

There is also a case where the error occurs without network_mode: host, but there is not enough information in the KB thread to determine the cause. I've added a suggestion in KB for support agents to ask the user if there are any mapped port conflicts in their docker-compose.yml, but without a more detailed situation where this occurs, it would be counterproductive to discuss this second case here. (However, a possible cause could be the mapped port conflicts I mentioned.)

In the ticket linked above, the user is experiencing the first error case, where both host mode and port mappings are present.

-- End context dump --

An invalid user docker-compose.yml is not the Supervisor's responsibility to fix and recover from, but I think better error communication could reduce the user's friction. I don't believe the Supervisor is the best place for this communication, because it's better UX for the user to receive their validation error before waiting for push, build, deploy, download, and update on their device.

The best place in my opinion is the CLI (or balenaAPI), wherever the user is pushing an app from. We have a module resin-compose-parse which could be extended to validate if needed.

This issue is relatively low in severity though, since it's user-originated and user-fixable.

jellyfish-bot commented 3 years ago

[cywang117] This issue has attached support thread https://jel.ly.fish/d67cfe53-204a-4371-82e6-773df31158f4

jellyfish-bot commented 2 years ago

[ab77] This issue has attached support thread https://jel.ly.fish/3f2b1d4e-8cb1-4ee4-98e5-2fe813b9437d

jellyfish-bot commented 2 years ago

[cywang117] This issue has attached support thread https://jel.ly.fish/a46057e3-6864-43cd-8792-6a4320f43bcc

dequis commented 2 years ago

I'm seeing this happen randomly - given a deploy to ~100 devices, usually 1 or 2 of them get stuck like this. I wouldn't call this a docker-compose issue. Host mode network is not used at all. There are no conflicting mappings.

Jan 18 10:44:10 5888c56 resin-supervisor[2412]: [error] Scheduling another update attempt in 256000ms due to failure: Error: Failed to apply state transition steps. (HTTP code 500) server error - driver failed programming external connectivity on endpoint redacted_4448611_2042345 (c3400672f286f23c30242e064a7cf8f144373c1105a9febb01b20f8f9da9bd41): Bind for 0.0.0.0:9000 failed: port is already allocated Steps:["start","start","start","start","start","start","start","start"]

For the specific device in which this failed, the application doesn't actually listen in port 9000 (and no one tries to connect to it)

Sometimes this also happens to another container that listens on port 80 (which receives balena-tunneled connections sometimes). That's the only other port mapping.

Sadly I can't provide more details, this is more likely to happen on production devices (because of the amount) and since the application is down when it's in this state, we only have a few minutes to investigate and so far the only way we know to get it running again is to reboot

jellyfish-bot commented 2 years ago

[gantonayde] This issue has attached support thread https://jel.ly.fish/580d81b2-398d-4f92-a616-bad814e07b91