💁 Feedback request on agent support for swarm and remote hosts

amir20 commented 4 weeks ago

Why is agent needed?

Dozzle doesn't support swarm mode natively. Remote hosts are done by setting up connections with TLS or https://github.com/Tecnativa/docker-socket-proxy. However, there has been many problems with this approach. Here is some I have found:

Security - docker-socket-proxy is insecure. Setting up TLS is the preferred route, but doing so is very complicated. It requires generating certs, updating docker settings and listening to TCP port.
Ease of use - The majority of bugs I get are related to not being able to setup socket-proxy. Dozzle works as design but people find the two might break.
Stability - Dozzle uses the native Docker API which handles connectivity. When a connection fails, Dozzle doesn't try to reconnect. A lot of folks have wanted to reconnect.

Setting up an agent can solve many problems.

How will the agent be implemented?

Currently, I am working on an agent using gRPC. It will use TLS by default. Compose file for swarm might look like:


services:
  dozzle:
    image: amir20/dozzle
    ports:
      - "8080:8080"
    environment:
      DOZZLE_MODE: swarm
    deploy:
      mode: global
     endpoint_mode: dnsrr

In this swarm mode, I want minimum setup. Dozzle will be smart enough to create it's own mesh using endpoint_mode: dnsrr.

For remote hosts, it could look like this:

services:
  dozzle:
    image: amir20/dozzle
    ports:
    environment:
      DOZZLE_REMOTE_AGENTS: host1,host2

Where the agents are setup remotely like this:

services:
  agent:
    image: amir20/dozzle
    ports:
      - 7007:7007
    environment:
      DOZZLE_MODE: agent

In this mode, the agent will expose its own API over port 7007. Then there will be one installation of Dozzle UI which will connect to all agents.

Deprecation of remote connections

I think with the implementation of agents, it would be safe to remove remote connections. However, I am not sure if people still care to use it. I can't imagine a use where remote connection with socket-proxy is preferred over Dozzle's own agent.

Questions on top of my head

How do the sample compose files look? Are they easy enough to use?
How much do folks care about security?
Should the use of socket-proxy be removed in the future?
What are some example of wanting to keep remote connections with agent support?

Feedback welcomed!

githubbiswb commented 3 weeks ago

I think an agent is a good idea. This is how the portainer project also handles multiple nodes.

I also can understand wanting to steer away from docker-socket-proxy especially for reason 2.

Security does matter a lot, but I also have found that that is easily solved by creating a docker network in swarm that isn't exposed and dozzle and the docker-socket-proxy containers communicate across that. This way TLS doesn't cause any problems, and we don't have any need of exposing any docker socket to the network.

I am not sure why you would want to use the dnsrr feature, is this to be able to get to the proxy hosts? But then wouldn't dnsrr mask each host and just reply with a random one from the pool? This of course based on if I understand dnsrr, and I can't say that I really do, plus I haven't seen it used in any other project I have come across. Anecdotal evidence of 1, no doubt.

Using the internal docker network, you just have to point at the hostnames of the containers by doing nodehostname-agentservicename:agentlisteningport and that port doesn't even have to be exposed on the network. Containers still listen on their internal port side even if you don't give that port external access.

This is how my reverse proxy is able to connect with dozzle in my setup. Dozzle has no ports exposed, and the dozzle container and my reverse proxy share an internal network, that the reverse proxy reaches dozzle on. Then only my reverse proxy has an externally exposed network, that my web browser can get to.

Let me know if my thoughts are too rambling around here, happy to clarify my thoughts.

amir20 commented 3 weeks ago

Hey @githubbiswb thanks for this! Here are the answers:

I also can understand wanting to steer away from docker-socket-proxy especially for reason 2.

+1

Security does matter a lot, but I also have found that that is easily solved by creating a docker network in swarm that isn't exposed and dozzle and the docker-socket-proxy containers communicate across that. This way TLS doesn't cause any problems, and we don't have any need of exposing any docker socket to the network.

That's correct. However, docker network is only valid for swarm. But I have noticed a lot of people don't have swarm setup. They want to use Dozzle to work across network boundaries. I also thinking about the future, k8s or even any host setup could work with agents. Since I plan to use mTLS only the agents can talk to each other.

You are still right that for Swarm mode with overlay network won't matter.

I am not sure why you would want to use the dnsrr feature, is this to be able to get to the proxy hosts? But then wouldn't dnsrr mask each host and just reply with a random one from the pool? This of course based on if I understand dnsrr, and I can't say that I really do, plus I haven't seen it used in any other project I have come across. Anecdotal evidence of 1, no doubt.

That's right. I need a way to discover all replicas in agent mode so that they can communicate with each other. I guess there are two solutions 1) Use dnsrr or 2) Query the Docker API tasks endpoint directly. Using dnsrr, Docker provides all agent IP address when querying DNS.

Since I already have access to docker.sock then I think it makes sense to explore just querying the tasks API.

Using the internal docker network, you just have to point at the hostnames of the containers by doing nodehostname-agentservicename:agentlisteningport and that port doesn't even have to be exposed on the network. Containers still listen on their internal port side even if you don't give that port external access.

That's right. But if I do it right, you don't even need to map the hostname. Ideally, in swarm mode, I can automatically discover all nodes.

I am trying to make the deploying steps as SIMPLE as possible. Because one of the major wins for Dozzle IMO is that it's a one step deployment. My goal is to only require DOZZLE_MODE: swarm and mode: global.

This is how my reverse proxy is able to connect with dozzle in my setup. Dozzle has no ports exposed, and the dozzle container and my reverse proxy share an internal network, that the reverse proxy reaches dozzle on. Then only my reverse proxy has an externally exposed network, that my web browser can get to.

This should still work. The only tricky part will be that something like traefik will still randomly choose a replica. And Dozzle needs to be smart enough to function correctly in a cluster. This won't be easy. :)

Thanks again! It's good to bounce ideas with someone. I get so much usage out of Dozzle but not many people are advancesdto provided feedback.

amir20 commented 2 weeks ago

Agent mode can be tested now. Instructions are at https://github.com/amir20/dozzle/pull/3058

Censseo commented 2 weeks ago

Hey! So I have tested a little the new swarm mode. This is amazing, thks for the good work! I still have few inputs about how it is working and some fancy tricks than can change life:

So if I understand well, all agents can communicate to each others and when u use traefik, it connects randomly (with the load balancer of docker) to an instance which will connect and gather infos about all containers through other agents. While I understand this design simplify the "deployment", it has a huge cons: there is no master, so we can't "force" a agent to be the "main" endpoint of traefik. Meaning it can load balance you to a node that is in the other side of the planet, as I have nodes on 3 continents, it can induce latency and useless workload
There are little tricks in the UI that can simplify life: in the top of the list, you could add a toggle to minify the view, because the tree is all deployed and when u have a lot of stack it can be boring to slide all down
Would it be possible to add an option to disable the merge of logs by service so I can see each task of a service ?
I think it would be nice to sort stack tree list by name because now it is all messed up
On the main view, ram info are sometime wrong for some nodes, dunno why, it shows less ram than it really have and it doesn't refresh this data, like if it stopped measuring this metric

Anyway amazing work and I consider using it in production for my infra. I need to test more, the group label feature looks cool and I need more time playing with it. Good job!

amir20 commented 2 weeks ago

Hi @Censseo! Thanks for feedback. Some questions:

So if I understand well, all agents can communicate to each others and when u use traefik, it connects randomly (with the load balancer of docker) to an instance which will connect and gather infos about all containers through other agents. While I understand this design simplify the "deployment", it has a huge cons: there is no master, so we can't "force" a agent to be the "main" endpoint of traefik. Meaning it can load balance you to a node that is in the other side of the planet, as I have nodes on 3 continents, it can induce latency and useless workload

I think this would be up to the deployer to point their proxy to the instance they want. This isn't really a Dozzle issue as that's just how the overlay network work in Docker.

I am not sure what would be the alternate choice though. If I was to force on instance to the "master" it would still be close because it would have to fetch the list of containers from the other instance that are far away.

Do you think there is a better way?

There are little tricks in the UI that can simplify life: in the top of the list, you could add a toggle to minify the view, because the tree is all deployed and when u have a lot of stack it can be boring to slide all down

I am not doing too much UI updates in this release. The swarm mode UI has been live since v7. But maybe I can come back to having minimize all. I felt this was never needed because the UI on host mode is sticky. If you collapse a group it will remain collapsed.

Would it be possible to add an option to disable the merge of logs by service so I can see each task of a service ?

I am not sure what this means. In swarm UI, everything is merged. In host view, only one container can be viewed at a time so there is no merging. If merging is not needed then you should navigate to the specific host.

I think it would be nice to sort stack tree list by name because now it is all messed up

Yes this seems like a good idea.

On the main view, ram info are sometime wrong for some nodes, dunno why, it shows less ram than it really have and it doesn't refresh this data, like if it stopped measuring this metric

I am not sure why this is. I just use Docker's system api to fetch this data. So it might be limited to what Docker sees.

Screenshot would help.

amir20 commented 1 week ago

This is now deployed. Going to close this and accept feedback on a new issue.

Censseo commented 1 week ago

Sorry I didn't had the time to give you a proper answer. I continue using it and will do a nice review of what I meant earlier.

amir20 / dozzle