dhiaayachi / temporal

Temporal service
https://docs.temporal.io
MIT License
0 stars 0 forks source link

docker-compose multirole sometimes stuck on boot #195

Open dhiaayachi opened 2 weeks ago

dhiaayachi commented 2 weeks ago

The issue is about docker-compose-multirole.yaml example in temporalio/docker-compose repo. I am posting this here because temporalio/docker-compose does not have an issue page.

Expected Behavior

When I run docker compose -f docker-compose-multirole.yaml up the whole multirole cluster is up and running normally.

Actual Behavior

When I run docker compose -f docker-compose-multirole.yaml up, temporal-history is sometimes stuck at "Waiting for Temporal server to start...", unable to reach the frontend service via nginx. The whole cluster does not seem to be in sync because I cannot connect to the UI service as well.

This does not always happen. So you need to try this multiple times. To me it feels like the chance is about 20%, especially when you stop docker-compose after running the service for a long time.

Restarting temporal-frontend or temporal-frontend2 sometimes make the cluster to the normal state, but not always.

Steps to Reproduce the Problem

  1. git clone https://github.com/temporalio/docker-compose
  2. cd docker-compose
  3. docker plugin install grafana/loki-docker-driver:latest --alias loki --grant-all-permissions
  4. docker compose -f docker-compose-multirole.yaml up
  5. If it works well, press control+C to stop the docker-compose project and wait for a moment. Then run the previous command again until you come across the problem.

Specifications

Here is the following log:

temporal-nginx          | 192.168.16.6 - - [27/Feb/2024:13:03:16 +0000] "POST /grpc.health.v1.Health/Check HTTP/2.0" 204 0 "-" "grpc-go/1.59.0"
temporal-nginx          | 2024/02/27 13:03:16 [error] 29#29: *73 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.16.6, server: , request: "POST /grpc.health.v1.Health/Check HTTP/2.0", upstream: "grpc://192.168.16.8:7237", host: "temporal-nginx:7233"
temporal-nginx          | 2024/02/27 13:03:16 [error] 29#29: *73 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.16.6, server: , request: "POST /grpc.health.v1.Health/Check HTTP/2.0", upstream: "grpc://192.168.16.9:7236", host: "temporal-nginx:7233"
temporal-history        | Error: unable to health check "temporal.api.workflowservice.v1.WorkflowService" service: unexpected HTTP status code received from server: 204 (No Content); malformed header: missing HTTP content-type
temporal-history        | ('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)
temporal-history        | Waiting for Temporal server to start...
temporal-nginx          | 2024/02/27 13:03:18 [error] 29#29: *76 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.16.6, server: , request: "POST /grpc.health.v1.Health/Check HTTP/2.0", upstream: "grpc://192.168.16.9:7236", host: "temporal-nginx:7233"
temporal-nginx          | 2024/02/27 13:03:18 [error] 29#29: *76 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.16.6, server: , request: "POST /grpc.health.v1.Health/Check HTTP/2.0", upstream: "grpc://192.168.16.8:7237", host: "temporal-nginx:7233"
temporal-nginx          | 192.168.16.6 - - [27/Feb/2024:13:03:18 +0000] "POST /grpc.health.v1.Health/Check HTTP/2.0" 204 0 "-" "grpc-go/1.59.0"
temporal-history        | Error: unable to health check "temporal.api.workflowservice.v1.WorkflowService" service: unexpected HTTP status code received from server: 204 (No Content); malformed header: missing HTTP content-type
temporal-history        | ('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)
temporal-history        | Waiting for Temporal server to start...
temporal-history        | Error: unable to health check "temporal.api.workflowservice.v1.WorkflowService" service: unexpected HTTP status code received from server: 204 (No Content); malformed header: missing HTTP content-type
temporal-history        | ('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)
temporal-nginx          | 2024/02/27 13:03:19 [error] 29#29: *79 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.16.6, server: , request: "POST /grpc.health.v1.Health/Check HTTP/2.0", upstream: "grpc://192.168.16.8:7237", host: "temporal-nginx:7233"
temporal-nginx          | 2024/02/27 13:03:19 [error] 29#29: *79 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.16.6, server: , request: "POST /grpc.health.v1.Health/Check HTTP/2.0", upstream: "grpc://192.168.16.9:7236", host: "temporal-nginx:7233"
temporal-nginx          | 192.168.16.6 - - [27/Feb/2024:13:03:19 +0000] "POST /grpc.health.v1.Health/Check HTTP/2.0" 204 0 "-" "grpc-go/1.59.0"
temporal-history        | Waiting for Temporal server to start...
temporal-history        | Error: unable to health check "temporal.api.workflowservice.v1.WorkflowService" service: unexpected HTTP status code received from server: 204 (No Content); malformed header: missing HTTP content-type
temporal-history        | ('export TEMPORAL_CLI_SHOW_STACKS=1' to see stack traces)
temporal-nginx          | 2024/02/27 13:03:20 [error] 29#29: *82 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.16.6, server: , request: "POST /grpc.health.v1.Health/Check HTTP/2.0", upstream: "grpc://192.168.16.9:7236", host: "temporal-nginx:7233"
temporal-nginx          | 2024/02/27 13:03:20 [error] 29#29: *82 connect() failed (111: Connection refused) while connecting to upstream, client: 192.168.16.6, server: , request: "POST /grpc.health.v1.Health/Check HTTP/2.0", upstream: "grpc://192.168.16.8:7237", host: "temporal-nginx:7233"
temporal-nginx          | 192.168.16.6 - - [27/Feb/2024:13:03:20 +0000] "POST /grpc.health.v1.Health/Check HTTP/2.0" 204 0 "-" "grpc-go/1.59.0"
temporal-history        | Waiting for Temporal server to start...

Setting TEMPORAL_CLI_SHOW_STACKS=true on temporal-history does not help much:

temporal-history        | Error: unable to health check "temporal.api.workflowservice.v1.WorkflowService" service: unexpected HTTP status code received from server: 204 (No Content); malformed header: missing HTTP content-type
temporal-history        | Stack trace:
temporal-history        | goroutine 1 [running]:
temporal-history        | runtime/debug.Stack()
temporal-history        |       /opt/hostedtoolcache/go/1.20.11/x64/src/runtime/debug/stack.go:24 +0x64
temporal-history        | runtime/debug.PrintStack()
temporal-history        |       /opt/hostedtoolcache/go/1.20.11/x64/src/runtime/debug/stack.go:16 +0x1c
temporal-history        | github.com/temporalio/cli/app.HandleError(0x40003df438?, {0x2b492a0, 0x40007481e0})
temporal-history        |       /home/runner/work/cli/cli/app/app.go:73 +0x134
temporal-history        | github.com/urfave/cli/v2.(*App).handleExitCoder(0x40009d91e0?, 0x400014de00?, {0x2b492a0?, 0x40007481e0?})
temporal-history        |       /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/app.go:452 +0x3c
temporal-history        | github.com/urfave/cli/v2.(*Command).Run(0x40009d91e0, 0x40003eb480, {0x4000b10200, 0x1, 0x1})
temporal-history        |       /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/command.go:276 +0x768
temporal-history        | github.com/urfave/cli/v2.(*Command).Run(0x406ef60, 0x40003eb340, {0x4000b01be0, 0x2, 0x2})
temporal-history        |       /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/command.go:267 +0x948
temporal-history        | github.com/urfave/cli/v2.(*Command).Run(0x406f900, 0x40003eb280, {0x40009c92f0, 0x3, 0x3})
temporal-history        |       /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/command.go:267 +0x948
temporal-history        | github.com/urfave/cli/v2.(*Command).Run(0x40009dab00, 0x40003eb140, {0x400004c0c0, 0x4, 0x4})
temporal-history        |       /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/command.go:267 +0x948
temporal-history        | github.com/urfave/cli/v2.(*App).RunContext(0x40008da780, {0x2b6a678?, 0x4000058068}, {0x400004c0c0, 0x4, 0x4})
temporal-history        |       /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/app.go:332 +0x568
temporal-history        | github.com/urfave/cli/v2.(*App).Run(0x60?, {0x400004c0c0?, 0x400007c768?, 0x49b54?})
temporal-history        |       /home/runner/go/pkg/mod/github.com/urfave/cli/v2@v2.25.7/app.go:309 +0x40
temporal-history        | main.main()
temporal-history        |       /home/runner/work/cli/cli/cmd/temporal/main.go:14 +0x38
temporal-history        | Waiting for Temporal server to start...
dhiaayachi commented 1 day ago

Temporal docker-compose-multirole.yaml Issue

This issue is about the docker-compose-multirole.yaml example in the temporalio/docker-compose repository. Here's a breakdown of the issue and potential solutions:

Problem:

Potential Causes:

  1. Network Issues: The temporal-history service might be unable to connect to the temporal-frontend service due to network connectivity issues or DNS resolution problems.
  2. Race Condition: The temporal-history service might be starting before the temporal-frontend service is fully initialized and listening on the specified port, leading to a connection refusal.
  3. Service Startup Order: Docker Compose's service startup order might not guarantee that the temporal-frontend service is ready before the temporal-history service attempts to connect.
  4. Nginx Configuration: The Nginx configuration might have issues proxying gRPC traffic, leading to connectivity problems for the temporal-history service.
  5. Temporal Server Startup: The Temporal Server might be starting up slowly or encountering issues during startup, causing delays in the history service connecting to the frontend.

Troubleshooting Steps:

  1. Verify Network Connectivity:
    • Check network connectivity between the temporal-history container and the temporal-frontend container. Use ping or docker exec to run nslookup within the containers to validate DNS resolution.
  2. Adjust Service Startup Order:
    • In the docker-compose-multirole.yaml file, adjust the service startup order using the depends_on property to ensure the temporal-frontend service starts before the temporal-history service.
    • Example:
      services:
      temporal-frontend:
       # ...
       depends_on:
         - temporal-nginx
      temporal-history:
       # ...
       depends_on:
         - temporal-frontend
  3. Increase docker-compose up Timeouts:
    • If there are issues with service startup delays, increase the timeout value for the docker-compose up command to allow for more time for services to start.
    • Example:
      docker compose -f docker-compose-multirole.yaml up -t 300  # Increase the timeout to 5 minutes
  4. Check Nginx Configuration:
    • Review the docker-compose-multirole.yaml file for any errors in the Nginx configuration. Check the upstream directive in the temporal server block to ensure it is correctly configured and pointing to the temporal-frontend service.
    • Make sure your Nginx configuration has HTTP/2 enabled for gRPC traffic.
  5. Increase Temporal Server Startup Time:
    • Review the Temporal Server configuration ( development.yaml file) to ensure there are no delays in the server's startup process. If necessary, adjust values such as the maxJoinDuration or rpcAddress to improve connectivity and reduce startup delays.
  6. Use grpc-health-probe:
    • Use the grpc-health-probe tool from the grpc-ecosystem to check if the frontend service is healthy and reachable by the history service.

Additional Tips:

Conclusion:

This issue appears to be related to a combination of network connectivity, race conditions, and potentially service startup order issues. By carefully reviewing the configuration files and adjusting the service startup order, you should be able to mitigate this intermittent problem. If the issue persists, reach out to the Temporal community or support team for assistance.

dhiaayachi commented 1 day ago

Thanks for reporting this issue! The logs show that temporal-history is unable to health check the temporal.api.workflowservice.v1.WorkflowService service. This is most likely due to the temporal-frontend service not being available at the expected address, causing the connection to be refused.

The logs show the Frontend Service is running on 192.168.16.8:7237 and 192.168.16.9:7236 while the History service is trying to reach the Frontend Service on 192.168.16.8:7237 and 192.168.16.9:7236.

You may need to review your docker-compose-multirole.yaml to ensure these addresses are properly configured.

To help debug further, could you please tell me:

  1. How did you configure your docker-compose-multirole.yaml file?
  2. What are the IP addresses of your docker containers and their respective ports for temporal-frontend, temporal-frontend2, and temporal-history?
  3. Are you using any custom network settings?

Once I have this information I can provide a more tailored solution for your issue.

dhiaayachi commented 1 day ago

Thank you for reporting this issue. We are aware that the docker-compose-multirole.yaml example sometimes experiences a race condition.

The issue occurs when Temporal History service is unable to reach the Temporal Frontend Service via Nginx. This can happen due to a race condition in the startup process.

Here are some troubleshooting steps you can try:

  1. Increase Docker memory: Ensure that Docker has enough memory allocated to avoid resource constraints.
  2. Restart containers: Restart the Temporal History service and Temporal Frontend service containers to ensure that they're properly synchronized.
  3. Increase Nginx worker connections: Adjust the worker_connections setting in Nginx's configuration to increase the number of connections allowed.
  4. Check network connectivity: Verify that the containers can communicate with each other on the appropriate ports.

If these steps don't resolve the issue, please provide more information about your environment, including:

We can then provide more specific guidance.

For further reference, please refer to Temporal's docker-compose documentation.

dhiaayachi commented 1 day ago

Thanks for reporting the issue.

It seems like the issue is due to the temporal-history service failing to connect to the frontend service. This could be caused by a network issue or a problem with the docker-compose configuration.

Could you try running the docker-compose command with the --verbose flag to get more detailed logs?

In the meantime, you could try the following workarounds:

If none of these workarounds solve the issue, please provide the following information:

This information will help us understand the issue and find a solution.

dhiaayachi commented 1 day ago

Thanks for reporting this issue! It appears that you are experiencing a problem where the temporal-history service gets stuck when you start your multirole cluster using the example in the temporalio/docker-compose repo.

Could you tell me what version of Temporal Server you are using? I noticed you are using Temporal CLI version 1.22.4.

Also, have you tried running the docker-compose-multirole.yaml example with a different database backend like MySQL or Cassandra?

Here are some additional tips:

If the issue is still not resolved, please provide the following information to help us better understand the problem:

Let me know if you have any more questions.