Closed pcarranzav closed 1 year ago
$ journalctl -u resin-supervisor
# continuously prints the following every 30s
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: at /usr/src/app/dist/app.js:573:91245
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: at /usr/src/app/dist/app.js:573:91208
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: at m.buildPayload (/usr/src/app/dist/app.js:573:91218)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:573:90718)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: at emitNone (events.js:91:20)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: at IncomingMessage.emit (events.js:185:7)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: at endReadableNT (_stream_readable.js:974:12)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: at _combinedTickCallback (internal/process/next_tick.js:80:11)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: at process._tickCallback (internal/process/next_tick.js:104:9)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: reason: undefined,
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: statusCode: 400,
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: json: { message: '2 matches found based on name: network 1006040_default is ambiguous' } }
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: [2019-01-23T00:07:04.078Z] Apply error Error: (HTTP code 400) unexpected - 2 matches found based on name: network 1006040_default is ambiguous
Jan 23 00:07:34 ebe09d3 resin-supervisor[930]: [2019-01-23T00:07:34.073Z] Applying target state
Jan 23 00:07:34 ebe09d3 resin-supervisor[930]: [2019-01-23T00:07:34.377Z] Scheduling another update attempt due to failure: 30000 { Error: (HTTP code 400) unexpected - 2 matches found based on name: network 1006040_default is ambiguous
root@<hostos>:~# balena network ls
NETWORK ID NAME DRIVER SCOPE
520c9bffb1c5 1006040_default bridge local
b45e3cde3086 1006040_default bridge local
1d839a7b5d01 bridge bridge local
030457c2e70e host host local
7260ec4dc409 none null local
83c9ca308ad5 supervisor0 bridge local
The following "solves" the problem:
root@ebe09d3:~# balena network rm b45e3cde3086
b45e3cde3086
I added the CheckDuplicate
flag to the network create call to try and combat this, but it seems to not have made much of a difference, and every now and again another device gets into this state. I think we'll have to be more aggressive with our checking before creating a network.
A customer reported that they were running some ip link
commands on a device that experienced this issue, but it's not clear whether this was related (see linked Front conversations).
Can we make the supervisor itself delete duplicate networks when the condition causes an error to be thrown?
[garethtdavies] This issue has attached support thread https://jel.ly.fish/#/3ec7cc64-fb74-453c-a936-bac599e60dab
[gelbal] This issue has attached support thread https://jel.ly.fish/#/272d824c-cb28-40b9-b254-d6e528af4691
[tmigone] This issue has attached support thread https://jel.ly.fish/#/dc8d1a3d-aa7f-4136-b9ed-743fdd324f73
[pipex] This issue has attached support thread https://jel.ly.fish/2ed86df2-cccc-422c-992b-6b191a721a5b
[majorz] This issue has attached support thread https://jel.ly.fish/29ab2c55-6d80-4d3e-a4c2-8b3f63beeb64
[20k-ultra] This issue has attached support thread https://jel.ly.fish/15771763-9491-4a94-ac02-4d05ee47eb58
This is still going on with newest Supervisor at time of writing, v14.0.9
[cywang117] This has attached https://jel.ly.fish/caa3f919-7e7d-4ce2-8f85-235dadd6129a
Update on this, it seems that this is due to the engine having issues for handling concurrent calls to create the network. There is an open issue for this since 2016 https://github.com/moby/moby/issues/20648
An easy replication can be done as follows
root@9cafbb5:~# balena network create abc & balena network create abc &
[1] 1106
[2] 1107
root@9cafbb5:~# bbe73fd0d4d98f2bfb0eda3f2bdc52679ad94229fd87fa8f4cbdc8a20aa6741e
c228aa5ffebf21c4da42282c942228d8108f4817fbd6fd3e438f3f1615f015ed
[1]- Done balena network create abc
[2]+ Done balena network create abc
root@9cafbb5:~# balena network ls
NETWORK ID NAME DRIVER SCOPE
bbe73fd0d4d9 abc bridge local
c228aa5ffebf abc bridge local
While creating networks sequentially is not possible
root@9cafbb5:~# balena network create abc
83319bee57454ca03d83094088e9659453faef2a686bdd0bc3c56eba1e06f0e7
root@9cafbb5:~# balena network create abc
Error response from daemon: network with name abc already exists
I'm still not sure how this can happen on the supervisor, it is possible that the state engine is creating multiple createNetwork
steps under some conditions (most likely a race), if these steps are ran in parallel, then you may end up with more than one network with the same name.
I don't see a simple way the supervisor can create to createNetwork
steps. The generateStepsForComponent
will generate a single step for each component (i.e. volume/networks). Meaning if there is only one network defined, just one step will be generated. The function is only called once per cycle
Two possible hypotheses
createNetwork
step, the supervisor executes it, but while executing it, the engine takes some more time than usual due to some unknown reason. The supervisor re-tries generating steps and again it creates a new createNetwork
step. When the engine request succeed, the network has been created twice. This is unlikely, since the network creation is awaited by the supervisor (unlike the kill step, for instance) https://github.com/balena-os/balena-supervisor/blob/d11d4fba916fb39c398b87b82bfc4e2d9e16e10f/src/compose/composition-steps.ts#L267network is ambiguous
[pipex] This has attached https://jel.ly.fish/0813975f-c9af-45d7-a791-1f67b721c385
[cywang117] This has attached https://jel.ly.fish/eb798067-7fbd-47f4-9c6e-636d40e506c9
I just got an idea which could at least help us reduce some of these errors. This error will also appear when inspecting networks here https://github.com/balena-os/balena-supervisor/blob/cfd18a7620ebabc8c0bb636db9fcecf660a7eeec/src/compose/network-manager.ts#L14-L23
This happens because we are inspecting the network by Name
instead of by Id
. While we still have the issue if an container tries to attach the network by name, this will at least prevent issues happening with the supervisor network.
@pipex This sounds like a good idea!
As a test to see if querying network by ID would work, I ran the following on my dev machine:
❯ docker network create test & docker network create test
❯ docker network ls
NETWORK ID NAME DRIVER SCOPE
66195a914d19 bridge bridge local
4d827e5a1394 host host local
8709b3a2b1ab none null local
14c242df09f9 test bridge local
0d0f83c33452 test bridge local
❯ docker run --rm -it --network=14c242df09f9 alpine sh
docker: Error response from daemon: network test is ambiguous (2 matches found on name).
❯ docker run --rm -it --network=0d0f83c33452 alpine sh
docker: Error response from daemon: network test is ambiguous (2 matches found on name).
Unfortunately, it looks like the Engine still tries to query by network name under the hood, even if you provide a unique network ID to a container...
The other proactive move would be for the Supervisor to periodically check for duplicate networks, and if found, (1) kill containers associated with the network if any, (2) remove both instances of the network, (3) add create
steps for all containers killed and allow the state funnel to calculate and perform the network creation again. This doesn't get at the root cause of two network create calls occurring, but it's a preventative measure for a behavior that's very hard to reproduce using the Supervisor.
Unfortunately, it looks like the Engine still tries to query by network name under the hood, even if you provide a unique network ID to a container...
Oh no, this is bad news.
The other proactive move would be for the Supervisor to periodically check for duplicate networks, and if found, (1) kill containers associated with the network
Sure, but this is really adding a new execution path to the state engine which is always a bit risky.
However,
This hopefully means that there should not be any containers pointing to the duplicate network. Assuming that's the case, when getting the 400 on the inspect we could just remove the duplicates, throw an error and let the engine calculate the new path, which should now involve recreating the network.
Another measure we should take is, on the application manager, before returning the next steps, scan the list for any duplicate steps and remove them. This assumes the issue comes from a duplicate step but is a sanity check before executing the steps.
Seen once, in a device that was updated from version 7.1.11 to 7.1.14. I'm not sure if anything particular about this device or its application might have triggered this particular situation.
Basically, there were two networks with the same name (
<appId>_default
) and this made balena complain with aserver error - 500
because a command referencing the network by its name was ambiguous. Both networks had theio.resin.supervised
label. I still don't see how the supervisor would create an additional network with the same name if the network already exists - unless there's some weird race condition or something like that. Needs investigation, so if this happens again please record supervisor logs and what precise actions got the device in this state.Workaround was to delete one of the conflicting networks with
balena network rm <network ID>
.Front conversations
Front conversations