Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.83k stars 2.34k forks source link

Cannot poll some Scheduled tasks #86

Closed niwy closed 7 years ago

niwy commented 7 years ago

Deployment: SampleWorkers and the Server located on the same machine OS: tried on win 7 and linux respectively

Hello,

In my test, I always start 100 instances of an workflow with 2 parallel tasks. The workers could complete some tasks (the number changes every time) before getting stuck. Compared to the SampleWorker example, I only modified pollInterval and threadCount.

I checked the in_progress tasks with swagger, they look fine to me. But a poll request got empty response body with code 204. Here is an example:

{
    "workflowInstanceId": "a273b2fb-3e29-4323-a694-5f4821723228",
    "taskId": "4750c5ea-6db8-45a1-8d79-7dec96f52c0d",
    "callbackAfterSeconds": 0,
    "status": "SCHEDULED",
    "taskType": "task_2",
    "referenceTaskName": "task_2",
    "retryCount": 0,
    "seq": 3,
    "pollCount": 0,
    "taskDefName": "task_2",
    "scheduledTime": 1487298595800,
    "startTime": 0,
    "endTime": 0,
    "updateTime": 1487298595800,
    "startDelayInSeconds": 0,
    "retried": false,
    "callbackFromWorker": true,
    "responseTimeoutSeconds": 3600,
    "queueWaitTime": 0
  }

Any idea for this problem? Thanks.

v1r3n commented 7 years ago

@niwy are you running the server with in memory database, redis or dynomite? If running dynomite, what is the cluster size?

niwy commented 7 years ago

@v1r3n The Server is running with in memory database.

v1r3n commented 7 years ago

@niwy can you share the code if possible?

niwy commented 7 years ago

@v1r3n Sorry, I can't share the source code.

Compared to the main method in SampleWorker example,my "threadCount" = 200, and I change pollInterval before init().

coordinator.withPollInterval(5);

Besides, I added TimeUnit.SECONDS.sleep(5) in the execute() method to mimic an execution duration.

v1r3n commented 7 years ago

@niwy This is probably related to the to the in-memory redis implementation library used for loading up in memory server. For the production usage, you want to use dynomite (for HA configuration) or Redis server rather than in-memory database. I will try and see if I can fix the bug with the in-memory redis library or use an alternate implementation to avoid this issue.

niwy commented 7 years ago

@v1r3n Thanks, I will try with Redis first.

cquon commented 7 years ago

I am running into an issue similar but it seems to be the opposite. With an in memory database, I am able to poll for tasks. However, when I use redis / dynomite, I am receiving Status 204 when polling for task. Showing output from python wrapper I have for API calls.

In memory database: (java -jar build/libs/conductor-server-1.8.0-SNAPSHOT-all.jar)

$ python conductor.py getRunningWorkflows corey_flow_1 Sending [GET] request to Conductor Server (http://localhost:8080/api/workflow/running/corey_flow_1?version=1):

Received response from Conductor Server (localhost): Status: 200 []

$ python conductor.py  createTaskMetadata '[
  {
    "name": "corey_task_1",
    "description": "corey_task_1",
    "retryCount": 1,
    "timeoutSeconds": 0,
    "timeoutPolicy": "TIME_OUT_WF",
    "retryLogic": "FIXED",
    "retryDelaySeconds": 60,
    "responseTimeoutSeconds": 3600
  },
  {
    "name": "corey_task_2",
    "description": "corey_task_2",
    "retryCount": 1,
    "timeoutSeconds": 0,
    "timeoutPolicy": "TIME_OUT_WF",
    "retryLogic": "FIXED",
    "retryDelaySeconds": 60,
    "responseTimeoutSeconds": 3600
  }
]'
Sending [POST] request to Conductor Server (http://localhost:8080/api/metadata/taskdefs):
Body:
[
  {
    "name": "corey_task_1",
    "description": "corey_task_1",
    "retryCount": 1,
    "timeoutSeconds": 0,
    "timeoutPolicy": "TIME_OUT_WF",
    "retryLogic": "FIXED",
    "retryDelaySeconds": 60,
    "responseTimeoutSeconds": 3600
  },
  {
    "name": "corey_task_2",
    "description": "corey_task_2",
    "retryCount": 1,
    "timeoutSeconds": 0,
    "timeoutPolicy": "TIME_OUT_WF",
    "retryLogic": "FIXED",
    "retryDelaySeconds": 60,
    "responseTimeoutSeconds": 3600
  }
]

Received response from Conductor Server (localhost):
Status: 204

$  python conductor.py createWorkflowMetadata '  {
    "name": "corey_flow_1",
    "description": "A Simple Corey Workflow with 2 tasks",
    "version": 1,
    "tasks": [
      {
        "name": "corey_task_1",
        "taskReferenceName": "corey_task_1",
        "type": "SIMPLE",
        "startDelay": 0
      },
      {
        "name": "corey_task_2",
        "taskReferenceName": "corey_task_2",
        "type": "SIMPLE",
        "startDelay": 0
      }
    ],
    "schemaVersion": 2
  }'
Sending [POST] request to Conductor Server (http://localhost:8080/api/metadata/workflow):
Body:
  {
    "name": "corey_flow_1",
    "description": "A Simple Corey Workflow with 2 tasks",
    "version": 1,
    "tasks": [
      {
        "name": "corey_task_1",
        "taskReferenceName": "corey_task_1",
        "type": "SIMPLE",
        "startDelay": 0
      },
      {
        "name": "corey_task_2",
        "taskReferenceName": "corey_task_2",
        "type": "SIMPLE",
        "startDelay": 0
      }
    ],
    "schemaVersion": 2
  }

Received response from Conductor Server (localhost):
Status: 204

$ python conductor.py getRunningWorkflows corey_flow_1
Sending [GET] request to Conductor Server (http://localhost:8080/api/workflow/running/corey_flow_1?version=1):

Received response from Conductor Server (localhost):
Status: 200
[]

$python conductor.py startWorkflow corey_flow_1
Sending [POST] request to Conductor Server (http://localhost:8080/api/workflow/corey_flow_1?):
Body:
{}

Received response from Conductor Server (localhost):
Status: 200
2c7b52c6-1bb8-45a3-a852-7ab87e9d47ac

$ python conductor.py startWorkflow corey_flow_1
Sending [POST] request to Conductor Server (http://localhost:8080/api/workflow/corey_flow_1?):
Body:
{}

Received response from Conductor Server (localhost):
Status: 200
e8803cdc-fc99-474e-a7c3-c21fb47cbe76

$ python conductor.py getRunningWorkflows corey_flow_1
Sending [GET] request to Conductor Server (http://localhost:8080/api/workflow/running/corey_flow_1?version=1):

Received response from Conductor Server (localhost):
Status: 200
["2c7b52c6-1bb8-45a3-a852-7ab87e9d47ac","e8803cdc-fc99-474e-a7c3-c21fb47cbe76"]

Then when I poll for task I properly get output:
$ python conductor.py pollTask corey_task_1
Sending [GET] request to Conductor Server (http://localhost:8080/api/tasks/poll/corey_task_1):

Received response from Conductor Server (localhost):
Status: 200
{"taskType":"corey_task_1","status":"IN_PROGRESS","referenceTaskName":"corey_task_1","retryCount":0,"seq":1,"pollCount":1,"taskDefName":"corey_task_1","scheduledTime":1490042357212,"startTime":1490042384801,"endTime":0,"updateTime":1490042384830,"startDelayInSeconds":0,"retried":false,"callbackFromWorker":true,"responseTimeoutSeconds":3600,"workflowInstanceId":"2c7b52c6-1bb8-45a3-a852-7ab87e9d47ac","taskId":"919f03f2-34e6-4931-92b1-00b3ac59d0c1","callbackAfterSeconds":0,"queueWaitTime":27589,"taskStatus":"IN_PROGRESS"}

When I run the server with dynomite: (java -jar build/libs/conductor-server-1.8.0-SNAPSHOT-all.jar src/main/resources/server.properties)

$ cat src/main/resources/server.properties

#
# Copyright 2017 Netflix, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
db=dynomite

#Dynomite Cluster details.

#Dynomite cluster name
workflow.dynomite.cluster.name=dyn_o_mite

#format is host:port:rack separated by semicolon  
#workflow.dynomite.cluster.hosts=host1:port:rack;host2:port:rack:host3:port:rack
workflow.dynomite.cluster.hosts=127.0.0.1:8102:us-east-1a
#;127.0.0.2:8102:us-east-1b;127.0.0.3:8102:us-east-1c

#namespace for the keys stored in Dynomite/Redis 
workflow.namespace.prefix=conductor

#namespace prefix for the dyno queues
workflow.namespace.queue.prefix=conductor_queues

#no. of threads allocated to dyno-queues
queues.dynomite.threads=10

#non-quorum port used to connect to local redis.  Used by dyno-queues 
queues.dynomite.nonQuorum.port=6379

#Transport address to elasticsearch
#workflow.elasticsearch.url=localhost:7003

#Name of the elasticsearch cluster
#workflow.elasticsearch.index.name=wfe

On startup since I already have workflows created in this database:

$ python conductor.py getRunningWorkflows corey_flow_1 Sending [GET] request to Conductor Server (http://localhost:8080/api/workflow/running/corey_flow_1?version=1):

Received response from Conductor Server (localhost): Status: 200 ["9af452ed-10bd-479d-bdde-09f309428b96","cd313f1b-203f-45cd-b64e-0bf5a7d9ac90","8a730c84-65a1-48cd-b4e2-5e4c5e247a9f","0ebdb647-5066-484c-9fbf-103459344ea9","404a17ea-c7b3-4175-8c36-04ec85826a7e","cfcdd5af-beea-49f1-8468-23af144ee015","a277a3f8-5d54-44e9-8ca1-2fab45329d0b","6b91f1b9-474b-457e-b4c9-f3032779b43b","c1961de0-a36a-4aa2-aea9-74f73578bb08","aa805924-afeb-46c8-8e0a-99529aaba1b1","4eb67155-2d32-4fee-8991-81d2469391bc","52909d2a-7c69-4ca9-8c2c-254f1de6f03a","4b4861d5-45d3-4b13-85bf-f5554a35c3c1","319f9bf7-604d-476b-984c-a0883664c61d","80360587-a768-4c90-86a4-62757bc01e28","f0955cee-e3e8-4357-bd75-f1f42cb7adde","b8ee577d-48d1-4c56-b335-2fecd71a8552","3a590ccb-fbae-4335-8a4e-674909c13b85","1faf9c86-244f-4516-acf7-2f7e90e2f952","8efe7ad0-f702-4522-b3f6-cfcb23f1715c","a58f4067-75e3-4167-8a24-e0d05df5a86a","08845e9e-32ce-42cc-ba4d-d2b9b45b42cd"]

However when I try to poll for a task now I am receiving 204:

$ python conductor.py pollTask corey_task_1 Sending [GET] request to Conductor Server (http://localhost:8080/api/tasks/poll/corey_task_1):

Received response from Conductor Server (localhost): Status: 204

I am not sure if I am doing something wrong, but I expect the poll to work the same.

Thanks.

v1r3n commented 7 years ago

@cquon if you are running using dynomite, can you also add the following line to the property to set the rack/availability zone of the conductor server to be same as dynomite cluster config?

EC2_AVAILABILTY_ZONE=us-east-1a

I have updated the comments in the config file to reflect that. PS: formatted the files in your comment.

cquon commented 7 years ago

Thanks! Fixed the issue

cheveyo20 commented 6 years ago

fixed typo :) EC2_AVAILABILITY_ZONE=us-east-1a

JasperB commented 5 years ago

@v1r3n Hi regarding this issue, I have one question if we're using the aws redis cluster directly, do we need make sure the rack/availability zone of the conductor server to be same as redis cluster zone config? I'm asking this question because we found some workflow stuck in RUNNING status and some of the tasks pending as IN_PROGRESS, not sure it caused by our some of our conductor server zone not same as the redis cluster? In the meanwhile I checked the queue zrange conductor_queues.test.QUEUE._deciderQueue.ap-northeast-2c 0 -1 all workflows in this Q, seems has above issue, please help to give any suggestion, thanks!

srmocher commented 5 years ago

@cquon if you are running using dynomite, can you also add the following line to the property to set the rack/availability zone of the conductor server to be same as dynomite cluster config?

EC2_AVAILABILTY_ZONE=us-east-1a

I have updated the comments in the config file to reflect that. PS: formatted the files in your comment.

This is arguably a big deal and should be mentioned in the docs, especially if one is using a standalone redis deployment.