Gateway Timeout on Docker Swarm worker replicas

statickidz commented 1 month ago

To Reproduce

Create a Dokploy simply Docker Swarm configuration with 1 manager and 1 worker.
Create an app with https://github.com/Dokploy/swarm-test
Put more than 1 replica in the Swarm config
Verify all deployed replicas are splitting well in the two instances

Manager

Worker

Current vs. Expected behavior

I expect all the Docker Swarm containers work normally independently where the request goes both on manager and worker instances but it seems like when the request goes to worker instance I get Gateway Timeout, otherwise if it goes to manager works.

Example 1: Request > Swarm decides to go manager > Works
Example 2: Request > Swarm decides to go worker > Gateway Timeout

Provide environment information

Operating System:
  OS: Canonical-Ubuntu-22.04-aarch64-2024.06.26-0
  Arch : arm64
Dokploy version: v0.10.3
VPS Provider: Oracle Cloud
What application/services are you trying to deploy?: Simple Nodejs app

Which area(s) are affected? (Select all that apply)

Application, Docker Compose, Traefik, Docker

Additional context

To check that it's not a network issue between instances or something I created a rule to open all the ports in the security list, by the way I'm using this project to boot the instances: https://github.com/statickidz/dokploy-oci-free/

Siumauricio commented 1 month ago

hmm I think something is needed at the traefik level to make it able to route to the worker container.

statickidz commented 4 weeks ago

hmm I think something is needed at the traefik level to make it able to route to the worker container.

Is this something related to my environment or you were able to make it work before?

Siumauricio commented 3 weeks ago

@statickidz Yes, this already worked for me some time ago, however since we upgraded traefik to version 3 I haven't tried it, surely there was some change.

Siumauricio commented 3 weeks ago

I recently tested and is working for me used this docker image

Screenshot 2024-10-29 at 11 24 18 PM

Screenshot 2024-10-29 at 11 24 33 PM

Screenshot 2024-10-29 at 11 24 39 PM

I don't have any running container in the dokploy server

In the worker is running 6 instances Screenshot 2024-10-29 at 11 25 27 PM

The domain I've used Screenshot 2024-10-29 at 11 26 02 PM

and when you enter you will see this Screenshot 2024-10-29 at 11 26 22 PM

If you reload after a couple minutes the information should change since is using another private ip and everything, so the load balancing working fine Screenshot 2024-10-29 at 11 26 30 PM

statickidz commented 3 weeks ago

@Siumauricio I see! I just created a new Dokploy instances (manager and worker) in AWS to check if it was something related with OCI but I'm getting the same result, that's quite weird. As before, all ports opened, no issues joining the Swarm cluster but when the request leads to the worker I get the Gateway Timeout. At this point I'm not sure what could be.

Siumauricio commented 3 weeks ago

Did you make a custom installation? or did you installed with the official script?

statickidz commented 3 weeks ago

Did you make a custom installation? or did you installed with the official script?

For the main instance official script, for the workers the commands provided on the "Add Node" button.

https://github.com/statickidz/dokploy-oci-free/blob/main/bin/dokploy-main.sh https://github.com/statickidz/dokploy-oci-free/blob/main/bin/dokploy-worker.sh

Siumauricio commented 2 weeks ago

Have you check in the dashboard of dokploy if you have the worker associated in the cluster section?

Siumauricio commented 2 weeks ago

I see you are exiting docker swarm in the worker, then how did you link the worker to the manager, you follow the steps from the Add Node button manually?

I would recommend you first try using the traditional way that dokploy gives, that is linking the workers manually, if you see that it works, I think it would be a problem of your infrastructure setup.

binaryYuki commented 2 weeks ago

Is your infrastructure running on Oracle OCI? I encountered the same problem, but it runs normally if executed on the same node where Traefik is located.

statickidz commented 2 weeks ago

Have you check in the dashboard of dokploy if you have the worker associated in the cluster section?

Yep, it's been displayed correctly

I see you are exiting docker swarm in the worker, then how did you link the worker to the manager, you follow the steps from the Add Node button manually?

I would recommend you first try using the traditional way that dokploy gives, that is linking the workers manually, if you see that it works, I think it would be a problem of your infrastructure setup.

Same result either if I pre-install docker and I pre-leave swarm (like in the script) or if I take the Dokploy quick steps to install it.

For example, this is the last test on a fresh worker node with the dokploy steps, result is always Gateway Timeout:

@Siumauricio this is a test environment so if you feel you want to debug that in deep reach me, I can provide you the access to the instances

Is your infrastructure running on Oracle OCI? I encountered the same problem, but it runs normally if executed on the same node where Traefik is located.

Found it on the Oracle OCI, works well if I point all the instances to the manager with this like you say

But I feel this is not OCI related, because I created a couple of instances on AWS to try and the result was the same https://github.com/Dokploy/dokploy/issues/592#issuecomment-2447020784

binaryYuki commented 2 weeks ago

But I feel this is not OCI related, because I created a couple of instances on AWS to try and the result was the same

I just try it on my azure server and the same issue occurd.

@Siumauricio Can we try load balance of traefik like

[tcp.services]
  [tcp.services.app]
    [[tcp.services.app.weighted.services]]
      name = "appv1"
      weight = 3
    [[tcp.services.app.weighted.services]]
      name = "appv2"
      weight = 1

  [tcp.services.appv1]
    [tcp.services.appv1.loadBalancer]
      [[tcp.services.appv1.loadBalancer.servers]]
        address = "private-ip-server-1/:8080"

  [tcp.services.appv2]
    [tcp.services.appv2.loadBalancer]
      [[tcp.services.appv2.loadBalancer.servers]]
        address = "private-ip-server-2/:8080"

instead of pointing them directly to the service itself like

  services:
    animeapi-core-409c00-service-11:
      loadBalancer:
        servers:
          - url: http://animeapi-core-409c00:8000

Dokploy / dokploy