Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.84k stars 2.34k forks source link

During the running process of the workflow, it terminated unexpectedly, and the task is in the cancel state #3576

Open zengqinglei opened 1 year ago

zengqinglei commented 1 year ago

Describe the bug When I run a workflow, the workflow generally has several tasks, sending http requests asynchronously will occasionally and randomly terminate unexpectedly, and the task is in the cancel state

Details Conductor version:3.13.0~3.13.5 Persistence implementation: postgres Queue implementation: redis Lock: No Workflow definition:

{
  "createTime": 1672215603359,
  "updateTime": 1672217295385,
  "accessPolicy": {},
  "name": "multi_times_request_test",
  "description": "多次请求测试",
  "version": 7,
  "tasks": [
    {
      "name": "等待15秒",
      "taskReferenceName": "wait_data_write",
      "inputParameters": {
        "duration": "15seconds"
      },
      "type": "WAIT",
      "startDelay": 0,
      "optional": false,
      "asyncComplete": false
    },
    {
      "name": "基础能力开通",
      "taskReferenceName": "baseOpen",
      "inputParameters": {},
      "type": "FORK_JOIN",
      "forkTasks": [
        [
          {
            "name": "开通yunlianTask",
            "taskReferenceName": "openyunlianTask",
            "inputParameters": {
              "inputValue": "${workflow.input.APP_AUTHS_MAP}"
            },
            "type": "SWITCH",
            "decisionCases": {
              "open": [
                {
                  "name": "开通yunlian",
                  "taskReferenceName": "openyunlian",
                  "inputParameters": {
                    "http_request": {
                      "uri": "https://mock.uutool.cn/test_request",
                      "method": "POST",
                      "connectionTimeOut": 30000,
                      "readTimeOut": 30000,
                      "headers": {
                        "Authorization": "Bearer token"
                      },
                      "body": {
                        "title": "open_yl",
                        "platform_code": "yunfuwu",
                        "cst_buguid": "5b2c083c-49e4-45f7-8bff-8425db433c40",
                        "rds_provider": "huaweiyun",
                        "envDeployMode": 1,
                        "enterpriseId": 1,
                        "enterpriseName": "yunlian",
                        "enterpriseCode": "yl"
                      }
                    }
                  },
                  "type": "HTTP",
                  "startDelay": 0,
                  "optional": false,
                  "asyncComplete": true
                }
              ]
            },
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "evaluatorType": "javascript",
            "expression": "$.inputValue!=null && $.inputValue.cyyl != null ? 'open' : 'defalut' "
          }
        ],
        [
          {
            "name": "开通yunkongjianTask",
            "taskReferenceName": "openyunkongjianTask",
            "inputParameters": {
              "inputValue": "${workflow.input.APP_AUTHS_MAP}"
            },
            "type": "SWITCH",
            "decisionCases": {
              "open": [
                {
                  "name": "开通yunkongjian",
                  "taskReferenceName": "openyunkongjian",
                  "inputParameters": {
                    "http_request": {
                      "uri": "https://mock.uutool.cn/test_request",
                      "method": "POST",
                      "connectionTimeOut": 30000,
                      "readTimeOut": 30000,
                      "headers": {
                        "Authorization": "Bearer token"
                      },
                      "body": {
                        "title": "open_ykj",
                        "taskId": "${workflow.workflowId}_openyunkongjian"
                      }
                    }
                  },
                  "type": "HTTP",
                  "startDelay": 0,
                  "optional": false,
                  "asyncComplete": true
                }
              ]
            },
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "evaluatorType": "javascript",
            "expression": "$.inputValue!=null && $.inputValue.cyykj != null ? 'open' : 'defalut' "
          }
        ]
      ],
      "startDelay": 0,
      "optional": false,
      "asyncComplete": true
    },
    {
      "name": "基础能力开通结束",
      "taskReferenceName": "baseOpenAfter",
      "inputParameters": {},
      "type": "JOIN",
      "startDelay": 0,
      "joinOn": [
        "openyunlianTask",
        "openyunkongjian"
      ],
      "optional": false,
      "asyncComplete": false
    },
    {
      "name": "发送email",
      "taskReferenceName": "sendEmail",
      "inputParameters": {
        "http_request": {
          "headers": {
            "Authorization": "Bearer token"
          },
          "connectionTimeOut": 30000,
          "readTimeOut": 30000,
          "uri": "https://mock.uutool.cn/test_request",
          "method": "POST",
          "body": {
            "title": "send_mail",
            "conductorRunId": "${workflow.workflowId}"
          }
        }
      },
      "type": "HTTP",
      "startDelay": 0,
      "optional": true,
      "asyncComplete": false
    }
  ],
  "inputParameters": [],
  "outputParameters": {},
  "schemaVersion": 2,
  "restartable": true,
  "workflowStatusListenerEnabled": false,
  "ownerEmail": "example@email.com",
  "timeoutPolicy": "ALERT_ONLY",
  "timeoutSeconds": 0,
  "variables": {},
  "inputTemplate": {}
}

To Reproduce Steps to reproduce the behavior:

  1. After pulling git clone {url} to clone the code, use the following command to build the conductor image Add postgres persistent dependency package vim server/build.gradle
    runtimeOnly "org.glassfish.jaxb:jaxb-runtime:${revJAXB}"
    runtimeOnly "com.netflix.conductor:conductor-postgres-persistence:3.13.5"

    build image

    cd docker/server
    docker build -f Dockerfile -t conductor-server:v3.13.5 ../../
    cd docker/ui
    docker build -f Dockerfile -t conductor-ui:v3.13.5 ../../
  2. Define docker-compose.yaml
    
    version: '2.3'

services: conductor-server: environment:

volumes: esdata-conductor: driver: local

networks: internal:

3. Define docker-compose-postgres.yaml
``` yaml
version: '2.3'

services:
  conductor-server:
    environment:
      - TZ=Asia/Shanghai
      - CONFIG_PROP=config-postgres.properties
    image: hub.mingyuanyun.com/tools/conductor:server-v3.13.5
    container_name: conductor-server
    networks:
      - internal
    ports:
      - 8080:8080
    volumes:
      - ./server/config/config-postgres.properties:/app/config/config-postgres.properties
      - ./server/config/log4j2.xml:/app/config/log4j2.xml
      - ./server/logs:/app/logs
    healthcheck:
      test: [ "CMD", "curl","-I" ,"-XGET", "http://localhost:8080/health" ]
      interval: 60s
      timeout: 30s
      retries: 12
    links:
      - elasticsearch:es
      - redis:rs
      - postgres:postgresdb
    depends_on:
      elasticsearch:
        condition: service_healthy
      redis:
        condition: service_healthy
      postgres:
        condition: service_healthy
    logging:
      driver: "json-file"
      options:
        max-size: "1024m"
        max-file: "3"

  postgres:
    image: postgres
    environment:
      - TZ=Asia/Shanghai
      - POSTGRES_USER=conductor
      - POSTGRES_PASSWORD=conductor
    volumes:
      - pgdata-conductor:/var/lib/postgresql/data
    networks:
      - internal
    ports:
      - 5432:5432
    healthcheck:
      test: timeout 5 bash -c 'cat < /dev/null > /dev/tcp/localhost/5432'
      interval: 5s
      timeout: 5s
      retries: 12
    logging:
      driver: "json-file"
      options:
        max-size: "500m"
        max-file: "3"

volumes:
  pgdata-conductor:
    driver: local

networks:
  internal:
  1. Define config-postgres.properties
    
    # Servers.
    conductor.grpc-server.enabled=false

Database persistence type.

conductor.db.type=postgres

spring.datasource.url=jdbc:postgresql://postgres:5432/conductor spring.datasource.username=conductor spring.datasource.password=conductor

Hikari pool sizes are -1 by default and prevent startup

spring.datasource.hikari.maximum-pool-size=10 spring.datasource.hikari.minimum-idle=2

Dynomite Cluster details.

format is host:port:rack separated by semicolon

conductor.redis.hosts=rs:6379:us-east-1c

Namespace for the keys stored in Dynomite/Redis

conductor.redis.workflowNamespacePrefix=conductor

Namespace prefix for the dyno queues

conductor.redis.queueNamespacePrefix=conductor_queues

No. of threads allocated to dyno-queues (optional)

queues.dynomite.threads=10

By default with dynomite, we want the repairservice enabled

conductor.app.workflowRepairServiceEnabled=true

Non-quorum port used to connect to local redis. Used by dyno-queues.

When using redis directly, set this to the same port as redis server

For Dynomite, this is 22122 by default or the local redis-server port used by Dynomite.

conductor.redis.queuesNonQuorumPort=22122

Elastic search instance indexing is enabled.

conductor.indexing.enabled=true

Transport address to elasticsearch

conductor.elasticsearch.url=http://es:9200 conductor.elasticsearch.indexReplicasCount=0

Name of the elasticsearch cluster

conductor.elasticsearch.indexName=conductor

Load sample kitchen sink workflow

loadSample=false

conductor.elasticsearch.clusterHealthColor=yellow

logging.config=/app/config/log4j2.xml logging.log4j2.config.override=/app/config/log4j2.xml

5. Define log4j2.xml
``` xml
<Configuration status="WARN">
    <Appenders>
        <Console name="CONSOLE">
            <PatternLayout pattern="%d{ISO8601} %highlight{%-5level }[%style{%t}{bright,blue}] %style{%C{1.}}{bright,yellow}: %msg%n%throwable"/>
        </Console>
    </Appenders>

    <Loggers>
        <Root level="INFO">
            <AppenderRef ref="CONSOLE" />
        </Root>
    </Loggers>
</Configuration>
  1. Start up:docker-compose -f docker-compose.yaml -f docker-compose-postgres.yaml up
  2. Visit http://localhost:5000 to create a workflow
    {
    "createTime": 1672215603359,
    "updateTime": 1672217295385,
    "accessPolicy": {},
    "name": "multi_times_request_test",
    "description": "多次请求测试",
    "version": 7,
    "tasks": [
    {
      "name": "等待15秒",
      "taskReferenceName": "wait_data_write",
      "inputParameters": {
        "duration": "15seconds"
      },
      "type": "WAIT",
      "startDelay": 0,
      "optional": false,
      "asyncComplete": false
    },
    {
      "name": "基础能力开通",
      "taskReferenceName": "baseOpen",
      "inputParameters": {},
      "type": "FORK_JOIN",
      "forkTasks": [
        [
          {
            "name": "开通yunlianTask",
            "taskReferenceName": "openyunlianTask",
            "inputParameters": {
              "inputValue": "${workflow.input.APP_AUTHS_MAP}"
            },
            "type": "SWITCH",
            "decisionCases": {
              "open": [
                {
                  "name": "开通yunlian",
                  "taskReferenceName": "openyunlian",
                  "inputParameters": {
                    "http_request": {
                      "uri": "https://mock.uutool.cn/test_request",
                      "method": "POST",
                      "connectionTimeOut": 30000,
                      "readTimeOut": 30000,
                      "headers": {
                        "Authorization": "Bearer token"
                      },
                      "body": {
                        "title": "open_yl",
                        "platform_code": "yunfuwu",
                        "cst_buguid": "5b2c083c-49e4-45f7-8bff-8425db433c40",
                        "rds_provider": "huaweiyun",
                        "envDeployMode": 1,
                        "enterpriseId": 1,
                        "enterpriseName": "yunlian",
                        "enterpriseCode": "yl"
                      }
                    }
                  },
                  "type": "HTTP",
                  "startDelay": 0,
                  "optional": false,
                  "asyncComplete": true
                }
              ]
            },
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "evaluatorType": "javascript",
            "expression": "$.inputValue!=null && $.inputValue.cyyl != null ? 'open' : 'defalut' "
          }
        ],
        [
          {
            "name": "开通yunkongjianTask",
            "taskReferenceName": "openyunkongjianTask",
            "inputParameters": {
              "inputValue": "${workflow.input.APP_AUTHS_MAP}"
            },
            "type": "SWITCH",
            "decisionCases": {
              "open": [
                {
                  "name": "开通yunkongjian",
                  "taskReferenceName": "openyunkongjian",
                  "inputParameters": {
                    "http_request": {
                      "uri": "https://mock.uutool.cn/test_request",
                      "method": "POST",
                      "connectionTimeOut": 30000,
                      "readTimeOut": 30000,
                      "headers": {
                        "Authorization": "Bearer token"
                      },
                      "body": {
                        "title": "open_ykj",
                        "taskId": "${workflow.workflowId}_openyunkongjian"
                      }
                    }
                  },
                  "type": "HTTP",
                  "startDelay": 0,
                  "optional": false,
                  "asyncComplete": true
                }
              ]
            },
            "startDelay": 0,
            "optional": false,
            "asyncComplete": false,
            "evaluatorType": "javascript",
            "expression": "$.inputValue!=null && $.inputValue.cyykj != null ? 'open' : 'defalut' "
          }
        ]
      ],
      "startDelay": 0,
      "optional": false,
      "asyncComplete": true
    },
    {
      "name": "基础能力开通结束",
      "taskReferenceName": "baseOpenAfter",
      "inputParameters": {},
      "type": "JOIN",
      "startDelay": 0,
      "joinOn": [
        "openyunlianTask",
        "openyunkongjian"
      ],
      "optional": false,
      "asyncComplete": false
    },
    {
      "name": "发送email",
      "taskReferenceName": "sendEmail",
      "inputParameters": {
        "http_request": {
          "headers": {
            "Authorization": "Bearer token"
          },
          "connectionTimeOut": 30000,
          "readTimeOut": 30000,
          "uri": "https://mock.uutool.cn/test_request",
          "method": "POST",
          "body": {
            "title": "send_mail",
            "conductorRunId": "${workflow.workflowId}"
          }
        }
      },
      "type": "HTTP",
      "startDelay": 0,
      "optional": true,
      "asyncComplete": false
    }
    ],
    "inputParameters": [],
    "outputParameters": {},
    "schemaVersion": 2,
    "restartable": true,
    "workflowStatusListenerEnabled": false,
    "ownerEmail": "example@email.com",
    "timeoutPolicy": "ALERT_ONLY",
    "timeoutSeconds": 0,
    "variables": {},
    "inputTemplate": {}
    }
  3. Run this workflow multiple times
    {
    "APP_AUTHS_MAP": {
    "cyyl": [],
    "cyykj": []
    }
    }

    image

  4. It may happen randomly next that the process is terminated and the task is in a canceled state image Through the log, you can see that there is a log as follows: image

This problem has appeared frequently and randomly in our production environment, causing serious business flow problems. I hope your team can help us analyze the possible causes and provide some solutions as soon as possible. Thank you very much!

BradenEads commented 1 year ago

If it's of some help, based on the information provided, the issue seems to be related to the asynchronous execution of the HTTP tasks within the Conductor workflow.

You could try increasing timeouts. It's possible that the HTTP requests are taking longer to complete than the specified connection and read timeouts. You can try increasing the connection and read timeouts for the HTTP tasks to give them more time to complete.

It's also worth noting that in your workflow definition, the joinOn field of the "task4" task refers to "openyunlianTask" and "openyunkongjian", but these taskReferenceNames do not exist in your workflow. Make sure to update the joinOn field to use the correct taskReferenceNames of the tasks that need to be joined.

zengqinglei commented 1 year ago

@BradenEads I re-modified my template definition. The inconsistency of the task names in joinOn was because I deleted some data to avoid leaking key information, but I forgot to modify the information in joinOn.

Increasing the timeout of my task can not solve the previous problem, because this problem occurs very randomly, sometimes the task will be canceled and the process will be terminated in about 5 seconds, and sometimes it will not appear until more than ten minutes

Based on the situation I introduced above, do you have any other possible inferences?

manan164 commented 1 year ago

Hi @zengqinglei , Please check for logs like, Execution terminated of workflow It will have an exception wy the workflow is terminated. Most likely there will be a task that might have failed and exhausted the retry count so which would have triggered the terminate workflow exception.

zengqinglei commented 1 year ago

@manan164 I have already checked the log, because my task is asyncComplete: true, which means it is an asynchronous task. Therefore, after the request is initiated, wait for the service to process itself. At this time, our task may be randomly canceled unexpectedly, and there is only one in the log : Workflow {workflowId} is terminated because of null, and this will probably happen frequently in the future