Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.84k stars 2.34k forks source link

An exception occurred: Error polling system task in queue:[START_WORKFLOW|HTTP|JOIN...] #3575

Open zengqinglei opened 1 year ago

zengqinglei commented 1 year ago

Describe the bug When I start the conductor through docker-compose -f docker-compose.yaml -f docker-compose-dynomite.yaml up, frequent errors occur: Error polling system task in queue:[START_WORKFLOW|HTTP|JOIN...] image image

Details Conductor version:3.13.5 Persistence implementation: Dynomite Queue implementation: Dynomite Lock: NO

To Reproduce Steps to reproduce the behavior:

  1. After pulling git clone {url} to clone the code, use the following command to build the conductor image
    cd docker/server
    docker build -f Dockerfile -t conductor-server:v3.13.5 ../../
    cd docker/ui
    docker build -f Dockerfile -t conductor-ui:v3.13.5 ../../
  2. Define docker-compose.yaml
    
    version: '2.3'

services: conductor-server: environment:

volumes: esdata-conductor: driver: local

networks: internal:

4. Define docker-compose-dynomite.yaml
``` yaml
version: '2.3'

services:
  conductor-server:
    environment:
      - TZ=Asia/Shanghai
      - CONFIG_PROP=config.properties
    volumes:
      - ./server/config/config.properties:/app/config/config.properties
      - ./server/config/log4j2.xml:/app/config/log4j2.xml
      - ./server/logs:/app/logs
    links:
      - dynomite:dyno1
    depends_on:
      dynomite:
        condition: service_healthy

  dynomite:
    image: v1r3n/dynomite
    networks:
      - internal
    ports:
      - 8102:8102
    healthcheck:
      test: timeout 5 bash -c 'cat < /dev/null > /dev/tcp/localhost/8102'
      interval: 5s
      timeout: 5s
      retries: 12
    logging:
      driver: "json-file"
      options:
        max-size: "1024m"
        max-file: "3"

networks:
  internal:
  1. Define config.properties
    
    # Servers.
    conductor.grpc-server.enabled=false

Database persistence type.

conductor.db.type=dynomite

Dynomite Cluster details.

format is host:port:rack separated by semicolon

conductor.redis.hosts=dyno1:8102:us-east-1c

Dynomite cluster name

conductor.redis.clusterName=dyno1

Namespace for the keys stored in Dynomite/Redis

conductor.redis.workflowNamespacePrefix=conductor

Namespace prefix for the dyno queues

conductor.redis.queueNamespacePrefix=conductor_queues

No. of threads allocated to dyno-queues (optional)

queues.dynomite.threads=10

By default with dynomite, we want the repairservice enabled

conductor.app.workflowRepairServiceEnabled=true

Non-quorum port used to connect to local redis. Used by dyno-queues.

When using redis directly, set this to the same port as redis server

For Dynomite, this is 22122 by default or the local redis-server port used by Dynomite.

conductor.redis.queuesNonQuorumPort=22122

Elastic search instance indexing is enabled.

conductor.indexing.enabled=true

Transport address to elasticsearch

conductor.elasticsearch.url=http://es:9200

Name of the elasticsearch cluster

conductor.elasticsearch.indexName=conductor

conductor.event-queues.amqp.queueType=classic

conductor.event-queues.amqp.sequentialMsgProcessing=true

Additional modules for metrics collection exposed via logger (optional)

conductor.metrics-logger.enabled=true

conductor.metrics-logger.reportPeriodSeconds=15

Additional modules for metrics collection exposed to Prometheus (optional)

conductor.metrics-prometheus.enabled=true

management.endpoints.web.exposure.include=prometheus

To enable Workflow/Task Summary Input/Output JSON Serialization, use the following:

conductor.app.summary-input-output-json-serialization.enabled=true

Load sample kitchen sink workflow

loadSample=false

conductor.elasticsearch.clusterHealthColor=yellow

logging.config=/app/config/log4j2.xml

6. Define log4j2.xml
``` xml
<Configuration status="WARN">
    <Appenders>
        <Console name="CONSOLE">
            <PatternLayout pattern="%d{ISO8601} %highlight{%-5level }[%style{%t}{bright,blue}] %style{%C{1.}}{bright,yellow}: %msg%n%throwable"/>
        </Console>
    </Appenders>

    <Loggers>
        <Root level="INFO">
            <AppenderRef ref="CONSOLE" />
        </Root>
    </Loggers>
</Configuration>
  1. Start up:docker-compose -f docker-compose.yaml -f docker-compose-dynomite.yaml up
  2. See error image

But through the UI, I can still add workflows, but not run workflows image image

When I start conductor via docker-compose up , everything works fine.

Looking forward to your reply as soon as possible, thank you!

BradenEads commented 1 year ago

It seems like there might be an issue with the Dynomite configuration in your setup. Here are a few suggestions to troubleshoot and potentially fix the problem:

Check Dynomite logs: Inspect the logs of the Dynomite container to see if there are any errors or issues with the configuration. You can view the logs using the following command:

docker logs

It seems like there might be an issue with the Dynomite configuration in your setup. Here are a few suggestions to troubleshoot and potentially fix the problem:

Check Dynomite logs: Inspect the logs of the Dynomite container to see if there are any errors or issues with the configuration. You can view the logs using the following command: bash

docker logs <dynomite_container_id>

Replace with the actual container ID of your Dynomite instance.

Verify Dynomite configuration: Make sure that the Dynomite configuration is correct and all required properties are set. You can refer to the official Dynomite documentation for more information on how to configure Dynomite.

Test Dynomite connectivity: Check if Conductor can connect to the Dynomite instance by running a simple command from within the Conductor container:

docker exec -it <conductor_container_id> redis-cli -h dyno1 -p 8102 ping

Replace with the actual container ID of your Conductor instance. This command should return PONG if the connection is successful.

Test Dynomite cluster: Make sure that the Dynomite cluster is properly set up and running. If you are using a single-node setup for testing purposes, ensure that the dynomite service in the docker-compose-dynomite.yaml file has the correct configuration.

Inspect Conductor logs: Review the logs of the Conductor container for any error messages or issues related to Dynomite connectivity or configuration. You can view the logs using the following command:

docker logs <conductor_container_id>

Replace with the actual container ID of your Conductor instance.

By following these steps, you should be able to identify the root cause of the issue and resolve it. If the issue persists, please provide more information about your setup, any error messages, and logs to help diagnose the problem.

zengqinglei commented 1 year ago

@BradenEads I passed docker logs -f dynomite, and stopped the conductor-server, the log content of dynomite is as follows: image

I did not find the error from the log, my startup container docker-compose.yaml is as follows, can you help analyze the following:

version: '2.3'

services:
  dynomite:
    environment:
      - TZ=Asia/Shanghai
    container_name: dynomite
    image: v1r3n/dynomite
    networks:
      - internal
    ports:
      - 8102:8102
    healthcheck:
      test: timeout 5 bash -c 'cat < /dev/null > /dev/tcp/localhost/8102'
      interval: 5s
      timeout: 5s
      retries: 12
    logging:
      driver: "json-file"
      options:
        max-size: "1024m"
        max-file: "3"
lijia-rengage commented 1 year ago

I deployed dynomite and conductor on Kubernetes. And i have the same question.

my conductor conf

# Servers.
conductor.grpc-server.enabled=false

# Database persistence type.
conductor.db.type=dynomite

# Dynomite Cluster details.
# format is host:port:rack separated by semicolon
conductor.redis.hosts=dynomite.conductor.svc.cluster.local:8102:us-east-1b

# Dynomite cluster name
conductor.redis.clusterName=dynomite

# Namespace for the keys stored in Dynomite/Redis
conductor.redis.workflowNamespacePrefix=conductor

# Namespace prefix for the dyno queues
conductor.redis.queueNamespacePrefix=conductor_queues

# No. of threads allocated to dyno-queues (optional)
queues.dynomite.threads=10

# By default with dynomite, we want the repairservice enabled
conductor.app.workflowRepairServiceEnabled=true

# Non-quorum port used to connect to local redis.  Used by dyno-queues.
# When using redis directly, set this to the same port as redis server
# For Dynomite, this is 22122 by default or the local redis-server port used by Dynomite.
conductor.redis.queuesNonQuorumPort=22122

# Elastic search instance indexing is enabled.
conductor.indexing.enabled=true

# Transport address to elasticsearch
conductor.elasticsearch.url=http://elasticsearch.conductor.svc.cluster.local:9200

# Name of the elasticsearch cluster
conductor.elasticsearch.indexName=conductor
#conductor.event-queues.amqp.queueType=classic
#conductor.event-queues.amqp.sequentialMsgProcessing=true

# Additional modules for metrics collection exposed via logger (optional)
# conductor.metrics-logger.enabled=true
# conductor.metrics-logger.reportPeriodSeconds=15

# Additional modules for metrics collection exposed to Prometheus (optional)
# conductor.metrics-prometheus.enabled=true
# management.endpoints.web.exposure.include=prometheus

# To enable Workflow/Task Summary Input/Output JSON Serialization, use the following:
# conductor.app.summary-input-output-json-serialization.enabled=true

# Load sample kitchen sink workflow
loadSample=true

And i use the default conf of dynomite.

this is the error information: Cnp1kN39Iz

I checked the source code. It looks like something wired for build dependence. @ComponentScan(basePackages = {"com.netflix.conductor", "io.orkes.conductor"}) orkes played some tracks for that.

dynomite can't read lua script successfully/

Is there any idea of this?

manan164 commented 1 year ago

Hi @lijia-rengage , Dynomite is no longer supported. Can you please try with latest build which has orkes-queue?

lijia-rengage commented 1 year ago

Hi @lijia-rengage , Dynomite is no longer supported. Can you please try with latest build which has orkes-queue?

Could you plz tell me why we should abandon the use of Dynomite? And what are the advantages of using Orkes-Queue instead? Thanks a lot! Probably I would use it in a prod environment, so I would pay attention to its performance and stability.

Ash-win commented 1 year ago

It is the same behavior with latest release.