Service discovery not working on non 10.0.0.0/24 networks

function61 / promswarmconnect

Bridges Docker Swarm services to Prometheus without any changes to Prometheus

https://function61.com/

Apache License 2.0

24 stars 6 forks source link

Service discovery not working on non 10.0.0.0/24 networks #6

Closed ainsey11 closed 5 years ago

ainsey11 commented 5 years ago

Heya!

I've just done the following:

docker swarm init --default-addr-pool=172.31.0.0/16 joined 24 nodes to the cluster and deployed my services

promswarmconnect is on a manager node starts up fine, if I curl the https://promswarmconnect/v1/discover url from within a container on the same network, I get the error:

node not found for task bdvnlnu25rtnxsv3w0cwpsv3q

running the latest version from docker hub, container outputs no errors other than:

2019/04/17 21:43:20 runHttpServer [INFO] Started vdev

Any ideas?

Cheers! Ainsey

ainsey11 commented 5 years ago

to add, this was working fine on the default docker address pool, however we were having IP conflicts with our external network being on the same range, hence the change of the pool

ainsey11 commented 5 years ago

to add to the strangeness, I tagged another node with the "core" label, went and made a coffee and it seems to have sprung into life, very odd,

may have been user error on this one! though I'd be interested if you could explain to me what conditions cause the error, I tried to read the go files, but I'm afraid my go experience is limited

joonas-fi commented 5 years ago

Thanks for reporting - any bugs we can squash will make it better for me/others as well.. :)

That error message comes from this line: https://github.com/function61/promswarmconnect/blob/6504965b219a8a5bfe776bfd924ce2372056b673/cmd/promswarmconnect/dockerdiscovery.go#L78

If you scroll above, starting from here the code does:

fetch tasks
fetch services
fetch nodes

I don't think there even could be a race condition for a task existing before a node has joined the Swarm, because nodes are fetched after tasks are fetched.

Did you quote the error message verbatim? Because the error message is missing the node ID. It should say "node [node id] not found for task [task id]".

If your error message is really missing the node ID, there might be a case where a task exists before it is assigned to a node, which our code should take into account. This is just a hunch - I will research more.

joonas-fi commented 5 years ago

I did some research:

Docs won't say if the NodeID can be empty
Nor is there any help in the type definition
Docker source code implies it can be empty

Also, task state docs hint us (ASSIGNED = Docker assigned the task to nodes.) that task can exist before it is assigned to run on an explicit node.

Knowing this new info, a fix would be to just skip tasks that are not yet assigned to a node, since our context is discovering running containers.

joonas-fi commented 5 years ago

Version 20190418_0906_e8c58cac is now available that should fix this https://hub.docker.com/r/fn61/promswarmconnect/tags

I deployed this version in my production, and servers did not catch on fire 🎉

joonas-fi commented 5 years ago

So, my hypothesis is that you hit promswarmconnect's API just in the exact "wrong" time where Docker was thinking "hmm, which node should this task run at?", and there was a bug in promswarmconnect's assumptions that when tasks are born, they already are assigned properly.

Did you encounter the error just after you requested Docker to schedule new tasks, or just after you introduced new nodes to the Swarm?

After you encountered the error, did you try again to see if the same error shows up? If you did, how many times did you try again and for how long do you think the time window for errors was, since you mentioned that after you had coffee, it started working?

Just asking for curiosity, since if my hypothesis is correct then this issue fixed anyway.. :)

ainsey11 commented 5 years ago

Apologies for the delayed response!

in response to the node ID being missing, yep, that line is exactly as it was printed in the logs, the error was a new node to the swarm, however it did have some routing problems so once I resolved those everything seemed to spring into life, so I suspect it's a mix of bad timing with the node not being able to fully communicate with the swarm and promswarmconnects api,

hope that helps and thanks for your prompt assistance!

Ainsey

joonas-fi commented 5 years ago

Yeah, could be a mix of those problems. Thanks for reporting this! My change should fix this situation, now that unassigned tasks are skipped. Feel free to reopen if the same issue persists with the new version!