Closed niwy closed 7 years ago
@niwy are you running the server with in memory database, redis or dynomite? If running dynomite, what is the cluster size?
@v1r3n The Server is running with in memory database.
@niwy can you share the code if possible?
@v1r3n Sorry, I can't share the source code.
Compared to the main method in SampleWorker example,my "threadCount" = 200, and I change pollInterval before init().
coordinator.withPollInterval(5);
Besides, I added TimeUnit.SECONDS.sleep(5) in the execute() method to mimic an execution duration.
@niwy This is probably related to the to the in-memory redis implementation library used for loading up in memory server. For the production usage, you want to use dynomite (for HA configuration) or Redis server rather than in-memory database. I will try and see if I can fix the bug with the in-memory redis library or use an alternate implementation to avoid this issue.
@v1r3n Thanks, I will try with Redis first.
I am running into an issue similar but it seems to be the opposite. With an in memory database, I am able to poll for tasks. However, when I use redis / dynomite, I am receiving Status 204 when polling for task. Showing output from python wrapper I have for API calls.
In memory database: (java -jar build/libs/conductor-server-1.8.0-SNAPSHOT-all.jar)
$ python conductor.py getRunningWorkflows corey_flow_1 Sending [GET] request to Conductor Server (http://localhost:8080/api/workflow/running/corey_flow_1?version=1):
Received response from Conductor Server (localhost): Status: 200 []
$ python conductor.py createTaskMetadata '[
{
"name": "corey_task_1",
"description": "corey_task_1",
"retryCount": 1,
"timeoutSeconds": 0,
"timeoutPolicy": "TIME_OUT_WF",
"retryLogic": "FIXED",
"retryDelaySeconds": 60,
"responseTimeoutSeconds": 3600
},
{
"name": "corey_task_2",
"description": "corey_task_2",
"retryCount": 1,
"timeoutSeconds": 0,
"timeoutPolicy": "TIME_OUT_WF",
"retryLogic": "FIXED",
"retryDelaySeconds": 60,
"responseTimeoutSeconds": 3600
}
]'
Sending [POST] request to Conductor Server (http://localhost:8080/api/metadata/taskdefs):
Body:
[
{
"name": "corey_task_1",
"description": "corey_task_1",
"retryCount": 1,
"timeoutSeconds": 0,
"timeoutPolicy": "TIME_OUT_WF",
"retryLogic": "FIXED",
"retryDelaySeconds": 60,
"responseTimeoutSeconds": 3600
},
{
"name": "corey_task_2",
"description": "corey_task_2",
"retryCount": 1,
"timeoutSeconds": 0,
"timeoutPolicy": "TIME_OUT_WF",
"retryLogic": "FIXED",
"retryDelaySeconds": 60,
"responseTimeoutSeconds": 3600
}
]
Received response from Conductor Server (localhost):
Status: 204
$ python conductor.py createWorkflowMetadata ' {
"name": "corey_flow_1",
"description": "A Simple Corey Workflow with 2 tasks",
"version": 1,
"tasks": [
{
"name": "corey_task_1",
"taskReferenceName": "corey_task_1",
"type": "SIMPLE",
"startDelay": 0
},
{
"name": "corey_task_2",
"taskReferenceName": "corey_task_2",
"type": "SIMPLE",
"startDelay": 0
}
],
"schemaVersion": 2
}'
Sending [POST] request to Conductor Server (http://localhost:8080/api/metadata/workflow):
Body:
{
"name": "corey_flow_1",
"description": "A Simple Corey Workflow with 2 tasks",
"version": 1,
"tasks": [
{
"name": "corey_task_1",
"taskReferenceName": "corey_task_1",
"type": "SIMPLE",
"startDelay": 0
},
{
"name": "corey_task_2",
"taskReferenceName": "corey_task_2",
"type": "SIMPLE",
"startDelay": 0
}
],
"schemaVersion": 2
}
Received response from Conductor Server (localhost):
Status: 204
$ python conductor.py getRunningWorkflows corey_flow_1
Sending [GET] request to Conductor Server (http://localhost:8080/api/workflow/running/corey_flow_1?version=1):
Received response from Conductor Server (localhost):
Status: 200
[]
$python conductor.py startWorkflow corey_flow_1
Sending [POST] request to Conductor Server (http://localhost:8080/api/workflow/corey_flow_1?):
Body:
{}
Received response from Conductor Server (localhost):
Status: 200
2c7b52c6-1bb8-45a3-a852-7ab87e9d47ac
$ python conductor.py startWorkflow corey_flow_1
Sending [POST] request to Conductor Server (http://localhost:8080/api/workflow/corey_flow_1?):
Body:
{}
Received response from Conductor Server (localhost):
Status: 200
e8803cdc-fc99-474e-a7c3-c21fb47cbe76
$ python conductor.py getRunningWorkflows corey_flow_1
Sending [GET] request to Conductor Server (http://localhost:8080/api/workflow/running/corey_flow_1?version=1):
Received response from Conductor Server (localhost):
Status: 200
["2c7b52c6-1bb8-45a3-a852-7ab87e9d47ac","e8803cdc-fc99-474e-a7c3-c21fb47cbe76"]
Then when I poll for task I properly get output:
$ python conductor.py pollTask corey_task_1
Sending [GET] request to Conductor Server (http://localhost:8080/api/tasks/poll/corey_task_1):
Received response from Conductor Server (localhost):
Status: 200
{"taskType":"corey_task_1","status":"IN_PROGRESS","referenceTaskName":"corey_task_1","retryCount":0,"seq":1,"pollCount":1,"taskDefName":"corey_task_1","scheduledTime":1490042357212,"startTime":1490042384801,"endTime":0,"updateTime":1490042384830,"startDelayInSeconds":0,"retried":false,"callbackFromWorker":true,"responseTimeoutSeconds":3600,"workflowInstanceId":"2c7b52c6-1bb8-45a3-a852-7ab87e9d47ac","taskId":"919f03f2-34e6-4931-92b1-00b3ac59d0c1","callbackAfterSeconds":0,"queueWaitTime":27589,"taskStatus":"IN_PROGRESS"}
When I run the server with dynomite: (java -jar build/libs/conductor-server-1.8.0-SNAPSHOT-all.jar src/main/resources/server.properties)
$ cat src/main/resources/server.properties
#
# Copyright 2017 Netflix, Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
db=dynomite
#Dynomite Cluster details.
#Dynomite cluster name
workflow.dynomite.cluster.name=dyn_o_mite
#format is host:port:rack separated by semicolon
#workflow.dynomite.cluster.hosts=host1:port:rack;host2:port:rack:host3:port:rack
workflow.dynomite.cluster.hosts=127.0.0.1:8102:us-east-1a
#;127.0.0.2:8102:us-east-1b;127.0.0.3:8102:us-east-1c
#namespace for the keys stored in Dynomite/Redis
workflow.namespace.prefix=conductor
#namespace prefix for the dyno queues
workflow.namespace.queue.prefix=conductor_queues
#no. of threads allocated to dyno-queues
queues.dynomite.threads=10
#non-quorum port used to connect to local redis. Used by dyno-queues
queues.dynomite.nonQuorum.port=6379
#Transport address to elasticsearch
#workflow.elasticsearch.url=localhost:7003
#Name of the elasticsearch cluster
#workflow.elasticsearch.index.name=wfe
On startup since I already have workflows created in this database:
$ python conductor.py getRunningWorkflows corey_flow_1 Sending [GET] request to Conductor Server (http://localhost:8080/api/workflow/running/corey_flow_1?version=1):
Received response from Conductor Server (localhost): Status: 200 ["9af452ed-10bd-479d-bdde-09f309428b96","cd313f1b-203f-45cd-b64e-0bf5a7d9ac90","8a730c84-65a1-48cd-b4e2-5e4c5e247a9f","0ebdb647-5066-484c-9fbf-103459344ea9","404a17ea-c7b3-4175-8c36-04ec85826a7e","cfcdd5af-beea-49f1-8468-23af144ee015","a277a3f8-5d54-44e9-8ca1-2fab45329d0b","6b91f1b9-474b-457e-b4c9-f3032779b43b","c1961de0-a36a-4aa2-aea9-74f73578bb08","aa805924-afeb-46c8-8e0a-99529aaba1b1","4eb67155-2d32-4fee-8991-81d2469391bc","52909d2a-7c69-4ca9-8c2c-254f1de6f03a","4b4861d5-45d3-4b13-85bf-f5554a35c3c1","319f9bf7-604d-476b-984c-a0883664c61d","80360587-a768-4c90-86a4-62757bc01e28","f0955cee-e3e8-4357-bd75-f1f42cb7adde","b8ee577d-48d1-4c56-b335-2fecd71a8552","3a590ccb-fbae-4335-8a4e-674909c13b85","1faf9c86-244f-4516-acf7-2f7e90e2f952","8efe7ad0-f702-4522-b3f6-cfcb23f1715c","a58f4067-75e3-4167-8a24-e0d05df5a86a","08845e9e-32ce-42cc-ba4d-d2b9b45b42cd"]
However when I try to poll for a task now I am receiving 204:
$ python conductor.py pollTask corey_task_1 Sending [GET] request to Conductor Server (http://localhost:8080/api/tasks/poll/corey_task_1):
Received response from Conductor Server (localhost): Status: 204
I am not sure if I am doing something wrong, but I expect the poll to work the same.
Thanks.
@cquon if you are running using dynomite, can you also add the following line to the property to set the rack/availability zone of the conductor server to be same as dynomite cluster config?
EC2_AVAILABILTY_ZONE=us-east-1a
I have updated the comments in the config file to reflect that. PS: formatted the files in your comment.
Thanks! Fixed the issue
fixed typo :)
EC2_AVAILABILITY_ZONE=us-east-1a
@v1r3n Hi regarding this issue, I have one question if we're using the aws redis cluster directly, do we need make sure the rack/availability zone of the conductor server to be same as redis cluster zone config? I'm asking this question because we found some workflow stuck in RUNNING status and some of the tasks pending as IN_PROGRESS, not sure it caused by our some of our conductor server zone not same as the redis cluster? In the meanwhile I checked the queue zrange conductor_queues.test.QUEUE._deciderQueue.ap-northeast-2c 0 -1 all workflows in this Q, seems has above issue, please help to give any suggestion, thanks!
@cquon if you are running using dynomite, can you also add the following line to the property to set the rack/availability zone of the conductor server to be same as dynomite cluster config?
EC2_AVAILABILTY_ZONE=us-east-1a
I have updated the comments in the config file to reflect that. PS: formatted the files in your comment.
This is arguably a big deal and should be mentioned in the docs, especially if one is using a standalone redis deployment.
Deployment: SampleWorkers and the Server located on the same machine OS: tried on win 7 and linux respectively
Hello,
In my test, I always start 100 instances of an workflow with 2 parallel tasks. The workers could complete some tasks (the number changes every time) before getting stuck. Compared to the SampleWorker example, I only modified pollInterval and threadCount.
I checked the in_progress tasks with swagger, they look fine to me. But a poll request got empty response body with code 204. Here is an example:
Any idea for this problem? Thanks.