balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
148 stars 63 forks source link

Supervisor sometimes fails to start up container with mount error #1131

Open CameronDiver opened 4 years ago

CameronDiver commented 4 years ago

The error itself is:

61   │ [error]   Updating failed, but there's another update scheduled immediately:  Error: (HTTP code 500) server error - mount already exists with name 
62   │ [error]         at /usr/src/app/dist/app.js:596:111352
63   │ [error]       at /usr/src/app/dist/app.js:596:111315
64   │ [error]       at m.buildPayload (/usr/src/app/dist/app.js:596:111325)
65   │ [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:596:110825)
66   │ [error]       at IncomingMessage.emit (events.js:194:15)
67   │ [error]       at endReadableNT (_stream_readable.js:1125:12)
68   │ [error]       at process._tickCallback (internal/process/next_tick.js:63:19)

It's unclear why this happens, and seems to be something to do with mounting on top of existing files, but this should be supported. Definitely more investigation needed.

balena-ci commented 4 years ago

[majorz] This issue has attached support thread https://jel.ly.fish/#/support-thread~6c39ae5b-859b-4a74-aafa-73bf4998e520

adamshapiro0 commented 3 years ago

@CameronDiver was this ever resolved? Unfortunately there's no context here or link to a forum post or whatever, so I don't know what the original circumstances were, but we just hit this error on a customer device, though in our case it it was during a restart-service, not a software update.

We have a control application in one service that starts the main application via a restart-service request shortly after the device boots -- there's a 30 second wait to get around race conditions we've seen previously where the supervisor isn't running yet after the container has started. In this case, the device did the following:

Apr 30 00:44:38 4c02e8e 926b40eabc4a[1074]: [event]   Event: Service exit {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331}}
Apr 30 00:44:38 4c02e8e resin-supervisor[2642]: [event]   Event: Service exit {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331>
Apr 30 00:44:38 4c02e8e 926b40eabc4a[1074]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331}}
Apr 30 00:44:38 4c02e8e resin-supervisor[2642]: [event]   Event: Service stop {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331>
Apr 30 00:44:38 4c02e8e resin-supervisor[2642]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765>
Apr 30 00:44:38 4c02e8e 926b40eabc4a[1074]: [event]   Event: Service install {"service":{"appId":1623312,"serviceId":494587,"serviceName":"nautilus","releaseId":1765331}}
Apr 30 00:44:39 4c02e8e balenad[1074]: time="2021-04-30T00:44:39.057786603Z" level=error msg="Handler for POST /containers/create returned error: mount already exists wi>
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]   Error on POST /v2/applications/1623312/restart-service:  Error: (HTTP code 500) server error - mount already exists>
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]         at /usr/src/app/dist/app.js:14:1391724
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]   Error on POST /v2/applications/1623312/restart-service:  Error: (HTTP code 500) server error - mount already ex>
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]         at /usr/src/app/dist/app.js:14:1391724
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at /usr/src/app/dist/app.js:14:1391656
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at Modem.buildPayload (/usr/src/app/dist/app.js:14:1391676)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:14:1390929)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at /usr/src/app/dist/app.js:14:1391656
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at Modem.buildPayload (/usr/src/app/dist/app.js:14:1391676)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:14:1390929)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at IncomingMessage.emit (events.js:322:22)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at endReadableNT (_stream_readable.js:1187:12)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at IncomingMessage.emit (events.js:322:22)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at endReadableNT (_stream_readable.js:1187:12)
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [error]       at processTicksAndRejections (internal/process/task_queues.js:84:21)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [error]       at processTicksAndRejections (internal/process/task_queues.js:84:21)
Apr 30 00:44:39 4c02e8e 926b40eabc4a[1074]: [api]     POST /v2/applications/1623312/restart-service 503 - 1486.971 ms
Apr 30 00:44:39 4c02e8e resin-supervisor[2642]: [api]     POST /v2/applications/1623312/restart-service 503 - 1486.971 ms

After that, the supervisor was completely unresponsive. From the customer's perspective, it looked like the device was totally unusable since it could not start the software. The only way to resolve the issue was to reboot the device entirely. There's no way to remotely restart the supervisor from the dashboard.

We have seen this at least one other time on a different device in the past,.

adamshapiro0 commented 3 years ago

For reference, this device is using OS 2.50.1+rev1, and supervisor 11.4.10. This is a somewhat old version and we would like to update as soon as possible, but currently cannot due to an ongoing issue (https://forums.balena.io/t/uart-failure-after-upgrading-to-2-71-3-rev1-from-2-50-1-rev1-on-variscite-imx8m/275528/10).

cywang117 commented 3 years ago

Hey @adamshapiro0 , thanks for opening an issue. We will look into this and ping you with any further questions. In the meantime, you can refer to this section of balena's Device Debugging masterclass for how to restart the Supervisor from the dashboard. In short, the method is to run systemctl restart resin-supervisor in a Host OS terminal instance from the device's summary page.

adamshapiro0 commented 3 years ago

Hi @cywang117, thanks for the suggestion. We'll try that for sure.

For reference, I've been updating the following forum post extensively the last few days as the customer has encountered a number of unexpected 503 and 404 errors from the supervisor as he drives across the country: https://forums.balena.io/t/unexpected-supervisor-crash-container-restart-on-network-change/197748/195. There are a lot of details in there about the circumstances that seem to cause this problem. We've been wrestling with unexpected supervisor issues on startup for a very long time now.

We believe many of the supervisor issues relate back to possible handling of network connectivity changes at boot. Through the customer experimenting a lot, we have found that the error seems to happen most under the following circumstances:

  1. Wifi network (cell phone tether) is activated
  2. Boot device
  3. Device detects wifi network and network manager changes the network configuration some time after boot, and seemingly after the supervisor has begun to startup
  4. Issue a restart-service request to the supervisor 30 seconds after boot -- encounter 503 or 404 error most of the time

If we leave the wifi network offline during boot, wait a few minutes, and then turn it on after the restart-service request has been made and our application is running, the supervisor seems to work ok. The customer is using this workaround for now, but this is not really a sustainable solution obviously.