Two networks with the same name can make the supervisor unable to apply the target state (network is ambiguous)

pcarranzav commented 6 years ago

Seen once, in a device that was updated from version 7.1.11 to 7.1.14. I'm not sure if anything particular about this device or its application might have triggered this particular situation.

Basically, there were two networks with the same name (<appId>_default) and this made balena complain with a server error - 500 because a command referencing the network by its name was ambiguous. Both networks had the io.resin.supervised label. I still don't see how the supervisor would create an additional network with the same name if the network already exists - unless there's some weird race condition or something like that. Needs investigation, so if this happens again please record supervisor logs and what precise actions got the device in this state.

Workaround was to delete one of the conflicting networks with balena network rm <network ID>.

Front conversations

pdcastro commented 5 years ago

$ journalctl -u resin-supervisor
# continuously prints the following every 30s
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:     at /usr/src/app/dist/app.js:573:91245
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:     at /usr/src/app/dist/app.js:573:91208
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:     at m.buildPayload (/usr/src/app/dist/app.js:573:91218)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:     at IncomingMessage.<anonymous> (/usr/src/app/dist/app.js:573:90718)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:     at emitNone (events.js:91:20)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:     at IncomingMessage.emit (events.js:185:7)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:     at endReadableNT (_stream_readable.js:974:12)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:     at _combinedTickCallback (internal/process/next_tick.js:80:11)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:     at process._tickCallback (internal/process/next_tick.js:104:9)
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:   reason: undefined,
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:   statusCode: 400,
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]:   json: { message: '2 matches found based on name: network 1006040_default is ambiguous' } }
Jan 23 00:07:04 ebe09d3 resin-supervisor[930]: [2019-01-23T00:07:04.078Z] Apply error Error: (HTTP code 400) unexpected - 2 matches found based on name: network 1006040_default is ambiguous
Jan 23 00:07:34 ebe09d3 resin-supervisor[930]: [2019-01-23T00:07:34.073Z] Applying target state
Jan 23 00:07:34 ebe09d3 resin-supervisor[930]: [2019-01-23T00:07:34.377Z] Scheduling another update attempt due to failure:  30000 { Error: (HTTP code 400) unexpected - 2 matches found based on name: network 1006040_default is ambiguous

root@<hostos>:~# balena network ls
NETWORK ID          NAME                DRIVER              SCOPE
520c9bffb1c5        1006040_default     bridge              local
b45e3cde3086        1006040_default     bridge              local
1d839a7b5d01        bridge              bridge              local
030457c2e70e        host                host                local
7260ec4dc409        none                null                local
83c9ca308ad5        supervisor0         bridge              local

The following "solves" the problem:

root@ebe09d3:~# balena network rm b45e3cde3086
b45e3cde3086

CameronDiver commented 5 years ago

I added the CheckDuplicate flag to the network create call to try and combat this, but it seems to not have made much of a difference, and every now and again another device gets into this state. I think we'll have to be more aggressive with our checking before creating a network.

pdcastro commented 5 years ago

A customer reported that they were running some ip link commands on a device that experienced this issue, but it's not clear whether this was related (see linked Front conversations).

pdcastro commented 5 years ago

Can we make the supervisor itself delete duplicate networks when the condition causes an error to be thrown?

balena-ci commented 4 years ago

[garethtdavies] This issue has attached support thread https://jel.ly.fish/#/3ec7cc64-fb74-453c-a936-bac599e60dab

balena-ci commented 4 years ago

[gelbal] This issue has attached support thread https://jel.ly.fish/#/272d824c-cb28-40b9-b254-d6e528af4691

jellyfish-bot commented 4 years ago

[tmigone] This issue has attached support thread https://jel.ly.fish/#/dc8d1a3d-aa7f-4136-b9ed-743fdd324f73

jellyfish-bot commented 4 years ago

[pipex] This issue has attached support thread https://jel.ly.fish/2ed86df2-cccc-422c-992b-6b191a721a5b

jellyfish-bot commented 3 years ago

[majorz] This issue has attached support thread https://jel.ly.fish/29ab2c55-6d80-4d3e-a4c2-8b3f63beeb64

jellyfish-bot commented 3 years ago

[20k-ultra] This issue has attached support thread https://jel.ly.fish/15771763-9491-4a94-ac02-4d05ee47eb58

cywang117 commented 2 years ago

This is still going on with newest Supervisor at time of writing, v14.0.9

jellyfish-bot commented 2 years ago

[cywang117] This has attached https://jel.ly.fish/caa3f919-7e7d-4ce2-8f85-235dadd6129a

pipex commented 2 years ago

Update on this, it seems that this is due to the engine having issues for handling concurrent calls to create the network. There is an open issue for this since 2016 https://github.com/moby/moby/issues/20648

An easy replication can be done as follows

root@9cafbb5:~# balena network create abc & balena network create abc & 
[1] 1106
[2] 1107
root@9cafbb5:~# bbe73fd0d4d98f2bfb0eda3f2bdc52679ad94229fd87fa8f4cbdc8a20aa6741e
c228aa5ffebf21c4da42282c942228d8108f4817fbd6fd3e438f3f1615f015ed

[1]-  Done                    balena network create abc
[2]+  Done                    balena network create abc
root@9cafbb5:~# balena network ls
NETWORK ID          NAME                                       DRIVER              SCOPE
bbe73fd0d4d9        abc                                        bridge              local
c228aa5ffebf        abc                                        bridge              local

While creating networks sequentially is not possible

root@9cafbb5:~# balena network create abc
83319bee57454ca03d83094088e9659453faef2a686bdd0bc3c56eba1e06f0e7
root@9cafbb5:~# balena network create abc
Error response from daemon: network with name abc already exists

I'm still not sure how this can happen on the supervisor, it is possible that the state engine is creating multiple createNetwork steps under some conditions (most likely a race), if these steps are ran in parallel, then you may end up with more than one network with the same name.

pipex commented 2 years ago

I don't see a simple way the supervisor can create to createNetwork steps. The generateStepsForComponent will generate a single step for each component (i.e. volume/networks). Meaning if there is only one network defined, just one step will be generated. The function is only called once per cycle

https://github.com/balena-os/balena-supervisor/blob/d11d4fba916fb39c398b87b82bfc4e2d9e16e10f/src/compose/app.ts#L437

Two possible hypotheses

The supervisor generates the createNetwork step, the supervisor executes it, but while executing it, the engine takes some more time than usual due to some unknown reason. The supervisor re-tries generating steps and again it creates a new createNetwork step. When the engine request succeed, the network has been created twice. This is unlikely, since the network creation is awaited by the supervisor (unlike the kill step, for instance) https://github.com/balena-os/balena-supervisor/blob/d11d4fba916fb39c398b87b82bfc4e2d9e16e10f/src/compose/composition-steps.ts#L267
Something gets corrupted in the engine. When looking if the network exists, the engine answers with "network does not exist", the supervisor creates the network again (possible since the engine thinks the network does not exist), but when trying to remove, some of the old network data is still there and the engine replies with network is ambiguous

jellyfish-bot commented 2 years ago

[pipex] This has attached https://jel.ly.fish/0813975f-c9af-45d7-a791-1f67b721c385

jellyfish-bot commented 1 year ago

[cywang117] This has attached https://jel.ly.fish/eb798067-7fbd-47f4-9c6e-636d40e506c9

pipex commented 1 year ago

I just got an idea which could at least help us reduce some of these errors. This error will also appear when inspecting networks here https://github.com/balena-os/balena-supervisor/blob/cfd18a7620ebabc8c0bb636db9fcecf660a7eeec/src/compose/network-manager.ts#L14-L23

This happens because we are inspecting the network by Name instead of by Id. While we still have the issue if an container tries to attach the network by name, this will at least prevent issues happening with the supervisor network.

cywang117 commented 1 year ago

@pipex This sounds like a good idea!

cywang117 commented 1 year ago

As a test to see if querying network by ID would work, I ran the following on my dev machine:

❯ docker network create test & docker network create test

❯ docker network ls
NETWORK ID     NAME      DRIVER    SCOPE
66195a914d19   bridge    bridge    local
4d827e5a1394   host      host      local
8709b3a2b1ab   none      null      local
14c242df09f9   test      bridge    local
0d0f83c33452   test      bridge    local

❯ docker run --rm -it --network=14c242df09f9 alpine sh
docker: Error response from daemon: network test is ambiguous (2 matches found on name).

❯ docker run --rm -it --network=0d0f83c33452 alpine sh
docker: Error response from daemon: network test is ambiguous (2 matches found on name).

Unfortunately, it looks like the Engine still tries to query by network name under the hood, even if you provide a unique network ID to a container...

The other proactive move would be for the Supervisor to periodically check for duplicate networks, and if found, (1) kill containers associated with the network if any, (2) remove both instances of the network, (3) add create steps for all containers killed and allow the state funnel to calculate and perform the network creation again. This doesn't get at the root cause of two network create calls occurring, but it's a preventative measure for a behavior that's very hard to reproduce using the Supervisor.

pipex commented 1 year ago

Unfortunately, it looks like the Engine still tries to query by network name under the hood, even if you provide a unique network ID to a container...

Oh no, this is bad news.

The other proactive move would be for the Supervisor to periodically check for duplicate networks, and if found, (1) kill containers associated with the network

Sure, but this is really adding a new execution path to the state engine which is always a bit risky.

However,

We know the error will be thrown on inspect by name
And we know the error will happen when trying to add the duplicate network.
We also suspect that this can only happen if the networks are created at the same time

This hopefully means that there should not be any containers pointing to the duplicate network. Assuming that's the case, when getting the 400 on the inspect we could just remove the duplicates, throw an error and let the engine calculate the new path, which should now involve recreating the network.

pipex commented 1 year ago

Another measure we should take is, on the application manager, before returning the next steps, scan the list for any duplicate steps and remove them. This assumes the issue comes from a duplicate step but is a sanity check before executing the steps.

balena-os / balena-supervisor

Two networks with the same name can make the supervisor unable to apply the target state (network is ambiguous) #590