Netflix / conductor

Conductor is a microservices orchestration engine.
Apache License 2.0
12.83k stars 2.34k forks source link

Workflows stopping in random places with multi-node Conductor/Dynomite on AWS #432

Closed kpdude7 closed 6 years ago

kpdude7 commented 6 years ago

We currently have 3 Conductor instances backed by an ELB in AWS, one for each of 3 AZ's. We also have 3 Dynomite instances per AZ, for a total of 9. When we test assets through our workflows, sometimes an asset will get stuck in a workflow, and when viewing the workflow via the API, steps are just missing at the end. We also see periodic NoAvailableHostsExceptions in the Conductor logs:

Feb 21 16:43:36 ip-10-240-71-124 sh[17714]: %93005 [qtp1041451158-52] ERROR com.netflix.conductor.server.resources.GenericExceptionMapper - NoAvailableHostsException: [host=Host [hostname=UNKNOWN, ipAddress=UNKNOWN, port=0, rack: UNKNOWN, datacenter: UNKNOW, status: Down], latency=0(0), attempts=0]Token not found for key hash: 151863863 Feb 21 16:43:36 ip-10-240-71-124 sh[17714]: %com.netflix.dyno.connectionpool.exception.NoAvailableHostsException: NoAvailableHostsException: [host=Host [hostname=UNKNOWN, ipAddress=UNKNOWN, port=0, rack: UNKNOWN, datacenter: UNKNOW, status: Down], latency=0(0), attempts=0]Token not found for key hash: 151863863 Feb 21 16:43:36 ip-10-240-71-124 sh[17714]: at com.netflix.dyno.connectionpool.impl.hash.BinarySearchTokenMapper.getToken(BinarySearchTokenMapper.java:68)

We are running Conductor 1.8.2, Dynomite dynomite-v0.5.9-5_MuslCompatiblity, and redis 3.2.10. Below are representative conductor.yml and dynomite.yml files:

conductor.yml: db=dynomite workflow.dynomite.cluster.hosts=10.240.71.13:8102:us-east-1a;10.240.71.32:8102:us-east-1a;10.240.71.21:8102:us-east-1a;10.240.71.91:8102:us-east-1c;10.240.71.99:8102:us-east-1c;10.240.71.109:8102:us-east-1c;10.240.71.138:8102:us-east-1d;10.240.71.179:8102:us-east-1d;10.240.71.159:8102:us-east-1d workflow.dynomite.cluster.name=dynomite_cluster_sit workflow.namespace.prefix=conductor workflow.namespace.queue.prefix=conductor_queues_sit queues.dynomite.threads=100 queues.dynomite.nonQuorum.port=22122 workflow.elasticsearch.url=10.240.71.30:9300 workflow.elasticsearch.index.name=conductor server.connection-timeout=60000 workflow.system.task.worker.poll.count=25 workflow.system.task.worker.thread.count=25 logging.level.com.netflix.conductor=INFO EC2_AVAILABILITY_ZONE=us-east-1d

dynomite.yml: `dyn_o_mite: datacenter: us-east-1 rack: us-east-1a listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

During one of these occurrences, I noticed that of our 9 dynomites, one of them (10.240.71.16) was missing one of the dyn_seeds (10.240.71.70); when I looked at the log for 16, I saw:

[2018-02-20 19:53:46.622] dnode_accept:168 Accepting client connection from 10.240.71.70:62472 on sd 22 [2018-02-20 19:53:46.622] event_add_conn:207 adding conn <LOCAL_PEER_CLIENT 0x1527ba0 22 from '10.240.71.70:62472'> to active [2018-02-20 19:53:46.622] dnode_accept:210 <PEER_PROXY 0x14c74d0 12 listening on '0.0.0.0:8101'> accepted <LOCAL_PEER_CLIENT 0x1527ba0 22 from '10.240.71.70:62472'> [2018-02-20 19:53:46.622] event_del_out:181 removing conn <LOCAL_PEER_CLIENT 0x1527ba0 22 from '10.240.71.70:62472'> from active [2018-02-20 20:04:32.876] dnode_accept:168 Accepting client connection from 10.240.71.70:63703 on sd 22 [2018-02-20 20:04:32.876] event_add_conn:207 adding conn <LOCAL_PEER_CLIENT 0x888ef0 22 from '10.240.71.70:63703'> to active [2018-02-20 20:04:32.876] dnode_accept:210 <PEER_PROXY 0x8874d0 12 listening on '0.0.0.0:8101'> accepted <LOCAL_PEER_CLIENT 0x888ef0 22 from '10.240.71.70:63703'> [2018-02-20 20:04:32.876] event_del_out:181 removing conn <LOCAL_PEER_CLIENT 0x888ef0 22 from '10.240.71.70:63703'> from active

I'm not sure why these connections keep dropping, but this dynomite instance never re-accepted the connection from 70.

We have tried everything under the sun to address this issue, from increasing mbuf_size to decreasing our payload from one workflow task step to the next, and still see this issue.

kpdude7 commented 6 years ago

I am also posting this issue to the Dynomite forum.

v1r3n commented 6 years ago

@ipapapa

kpdude7 commented 6 years ago

More info: All of our Dynomite instances are running on m3.medium instance types in AWS. We are running 30 assets simultaneously through a particular workflow (below) as a load test; 29 of them complete successfully, while 1 gets "stuck" because tasks are missing in the dynoqueues. This behavior is intermittent and doesn't appear to be directly related to load (number of assets being tested) or payload (amount of data passed in between tasks); we have seen it happen with as few as 8 assets being tested.

workflow def: { "name": "c3", "description": "<test>", "version": 1, "tasks": [ { "name": "config_task", "taskReferenceName": "config", "type": "SIMPLE", "inputParameters": { "accountId": "${workflow.input.accountNumber}" } },{ "name": "c3_asset_data_task", "taskReferenceName": "c3_asset_data_1", "type": "SIMPLE", "inputParameters": { "taskMethod": "recording_complete", "assetState": "C3_RECORDING_COMPLETE", "configs": "${config.output.configs}", "accountId": "${workflow.input.accountNumber}", "primaryFileName": "${workflow.input.primaryFileName}", "secondaryFileName": "${workflow.input.secondaryFileName}", "lowResFileName": "${workflow.input.lowResFileName}", "secLowResFileName": "${workflow.input.secLowResFileName}", "primaryAlternateFileName": "${workflow.input.primaryAlternateFileName}", "secondaryAlternateFileName": "${workflow.input.secondaryAlternateFileName}", "primaryFileTimestamp": "${workflow.input.primaryFileTimestamp}", "secondaryFileTimestamp": "${workflow.input.secondaryFileTimestamp}", "primaryAlternateFileTimestamp": "${workflow.input.primaryAlternateFileTimestamp}", "secondaryAlternateFileTimestamp": "${workflow.input.secondaryAlternateFileTimestamp}" } },{ "name": "content_based_decision_task", "taskReferenceName": "content_based_decision_task_1", "inputParameters": { "contentBasedAction": "${c3_asset_data_1.output.contentBasedAction}" }, "type": "DECISION", "caseValueParam": "contentBasedAction", "decisionCases": { "continue": [ { "name": "notification_task", "taskReferenceName": "notification_1", "type": "SIMPLE", "inputParameters": { "taskMethod": "capture_complete", "accountId": "${workflow.input.accountNumber}", "adiAssetId": "${c3_asset_data_1.output.adiAssetId}", "adiProvider": "${c3_asset_data_1.output.adiProvider}", "adiTitle": "${c3_asset_data_1.output.adiTitle}", "adiDescription": "${c3_asset_data_1.output.adiDescription}", "adiHouseId": "${c3_asset_data_1.output.adiHouseId}", "message": "${c3_asset_data_1.output.message}", "eventDate": "${c3_asset_data_1.output.eventDate}", "notification_provider": "C3" } },{ "name": "video_editing_task", "taskReferenceName": "video_edit_1", "type": "SIMPLE", "inputParameters": { "taskMethod": "iframeAndScte", "configs": "${config.output.configs}", "sourceFilePathName": "${c3_asset_data_1.output.sourceFilePath}", "accountId": "${workflow.input.accountNumber}", "assetUID": "${c3_asset_data_1.output.assetUID}", "providerId": "${c3_asset_data_1.output.adiProvider}", "nextWorkflow": "c3ScteComplete" } } ], "fail": [ { "name": "sub_workflow_task", "taskReferenceName": "c3Failed_2", "inputParameters": { "assetUID": "${c3_asset_data_1.output.assetUID}" }, "type": "SUB_WORKFLOW", "subWorkflowParam": { "name": "c3Failed", "version": 1 } } ], "generateLowRes": [ { "name": "notification_task", "taskReferenceName": "notification_2", "type": "SIMPLE", "inputParameters": { "taskMethod": "capture_complete", "accountId": "${workflow.input.accountNumber}", "adiAssetId": "${c3_asset_data_1.output.adiAssetId}", "adiProvider": "${c3_asset_data_1.output.adiProvider}", "adiTitle": "${c3_asset_data_1.output.adiTitle}", "adiDescription": "${c3_asset_data_1.output.adiDescription}", "adiHouseId": "${c3_asset_data_1.output.adiHouseId}", "message": "${c3_asset_data_1.output.message}", "eventDate": "${c3_asset_data_1.output.eventDate}", "notification_provider": "C3" } },{ "name": "video_editing_task", "taskReferenceName": "video_edit_5", "type": "SIMPLE", "inputParameters": { "taskMethod": "iframeAndScte", "configs": "${config.output.configs}", "sourceFilePathName": "${c3_asset_data_1.output.sourceFilePath}", "accountId": "${workflow.input.accountNumber}", "assetUID": "${c3_asset_data_1.output.assetUID}", "providerId": "${c3_asset_data_1.output.adiProvider}", "nextWorkflow": "c3ScteCompleteLowRes" } } ], "halt": [] } } ], "schemaVersion": 2 } There are 3 keys that get written to redis for each workflow. Following are the values for a successful workflow completion: conductor.test.WORKFLOW.902589fe-b3a9-4d6c-a147-c864588c174a "{\"createTime\":1519917673928,\"updateTime\":1519917813748,\"status\":\"COMPLETED\",\"endTime\":1519917813748,\"workflowId\":\"902589fe-b3a9-4d6c-a147-c864588c174a\",\"input\":{\"md5HiResSecondary\":\"\",\"md5HiRes\":\"\",\"primaryFileName\":\"s3://atl-ops-sit-cf-c3/primary/SCIH0016358900200000_LT29.mpg\",\"accountNumber\":\"720\",\"md5LowRes\":\"\",\"lowResFileName\":\"s3://atl-ops-sit-cf-c3/primary/SCIH0016358900200000_LT29.mp4\"},\"output\":{\"s3ResultLoc\":\"s3://atl-ops-sit-cf-c3/primary/SCIH0016358900200000_LT29.mpg\"},\"workflowType\":\"c3\",\"version\":1,\"schemaVersion\":2,\"startTime\":1519917673928}"

conductor.test.WORKFLOW_TO_TASKS.902589fe-b3a9-4d6c-a147-c864588c174a 1) "a83f2b54-e87b-4bb8-9f90-0426c92d98f5" 2) "d9fccc23-aadc-4a75-943a-3dd3a5bf6ff7" 3) "bd271d67-0eb0-445a-ad2d-553627d9dd41" 4) "60a37f2c-c374-4235-8b9d-9b8fd5fe55b2" 5) "e819e738-b871-46e5-9bd6-52b5de160a0a"

conductor.test.SCHEDULED_TASKS.902589fe-b3a9-4d6c-a147-c864588c174a 1) "config0" 2) "60a37f2c-c374-4235-8b9d-9b8fd5fe55b2" 3) "c3_asset_data_10" 4) "bd271d67-0eb0-445a-ad2d-553627d9dd41" 5) "content_based_decision_task_10" 6) "d9fccc23-aadc-4a75-943a-3dd3a5bf6ff7" 7) "notification_10" 8) "a83f2b54-e87b-4bb8-9f90-0426c92d98f5" 9) "video_edit_10" 10) "e819e738-b871-46e5-9bd6-52b5de160a0a"

Following are values for an unsuccessful completion: conductor.test.WORKFLOW.2f7ca9c5-ec40-49f3-90b9-4b3aa72dd642 {\"createTime\":1519917674114,\"updateTime\":1519917760307,\"status\":\"RUNNING\",\"endTime\":0,\"workflowId\":\"2f7ca9c5-ec40-49f3-90b9-4b3aa72dd642\",

**conductor.test.WORKFLOW.2f7ca9c5-ec40-49f3-90b9-4b3aa72dd642** 1) "f7c797a0-27d9-4b0a-98e1-0d9790408b8f" 2) "f9d84115-6da8-4a6a-9d85-c1f15295acc9" **conductor.test.SCHEDULED_TASKS.2f7ca9c5-ec40-49f3-90b9-4b3aa72dd642** 1) "config0" 2) "f9d84115-6da8-4a6a-9d85-c1f15295acc9" 3) "c3_asset_data_10" 4) "f7c797a0-27d9-4b0a-98e1-0d9790408b8f" Something happened at the point the content_based_decision_task queue was being populated for this workflow; we have been unable to determine exactly what. Help is greatly appreciated here as we are preparing to go to pre-production with this code. TIA.
ipapapa commented 6 years ago

Your Dynomite topology seems to be off that is causing the Dyno client to have an issue with sending the data. Use the cluster_describe REST call to get a view of what Dynomite sees at runtime versus what you have provided in the YAML.

kpdude7 commented 6 years ago

@ipapapa Below is a full list of each of our 9 dynomite.yml files, along with the corresponding output of curl http://localhost:22122/cluster_describe on the same box (we're using 22122 instead of 22222). We have examined this doc and don't see any discrepancy between each yml and each cluster_describe; can you give us guidance on what exactly may be wrong?

IP 10.240.71.13 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1a listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1a", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }

IP 10.240.71.32 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1a listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1a", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }

IP 10.240.71.21 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1a listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1a", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 3333333333 }, { "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }

IP 10.240.71.91 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1c listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1c", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }

IP 10.240.71.109 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1c listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1c", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 3333333333 }, { "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }

IP 10.240.71.99 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1c listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1c", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }

IP 10.240.71.159 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1d listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1d", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 3333333333 }, { "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }] }] }

IP 10.240.71.138 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1d listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1d", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }] }] }

IP 10.240.71.179 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1d listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:

http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1d", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }] }] }

sungsk commented 6 years ago

Hi @ipapapa, I work with OP and I would also like the append additional stacktraces to maybe get a better guidance as to what's happening.

Conductor_NoAvailableHostsException_Stacktrace.txt Dynomite_Stacktrace.txt

saidatta commented 6 years ago

I am encountering a similar error with 2 nodes.

ipapapa commented 6 years ago

@dev-sungk looking into the Dynomite stacktrace, somebody is sending HTTP requests to the Dynomite/Redis port. Hence the redis parser in Dynomite cannot deserialize that request.

[2018-03-02 19:52:12.792] redis_parse_req:1836 parsed bad req 1171817 res 1 type 0 state 0
00000000  47 45 54 20 2f 76 65 72  73 69 6f 6e 20 48 54 54   |GET /version HTT|
00000010  50 2f 31 2e 31 0d 0a 55  73 65 72 2d 41 67 65 6e   |P/1.1..User-Agen|
00000020  74 3a 20 63 75 72 6c 2f  37 2e 33 38 2e 30 0d 0a   |t: curl/7.38.0..|
00000030  48 6f 73 74 3a 20 6c 6f  63 61 6c 68 6f 73 74 3a   |Host: localhost:|
00000040  38 31 30 32 0d 0a 41 63  63 65 70 74 3a 20 2a 2f   |8102..Accept: */|
00000050  2a 0d 0a 0d 0a                                     |*....|

and


00000000  47 45 54 20 2f 52 45 53  54 2f 76 31 2f 61 64 6d   |GET /REST/v1/adm|
00000010  69 6e 2f 73 74 61 74 75  73 20 48 54 54 50 2f 31   |in/status HTTP/1|
00000020  2e 31 0d 0a 55 73 65 72  2d 41 67 65 6e 74 3a 20   |.1..User-Agent: |
00000030  63 75 72 6c 2f 37 2e 33  38 2e 30 0d 0a 48 6f 73   |curl/7.38.0..Hos|
00000040  74 3a 20 6c 6f 63 61 6c  68 6f 73 74 3a 38 31 30   |t: localhost:810|
00000050  32 0d 0a 41 63 63 65 70  74 3a 20 2a 2f 2a 0d 0a   |2..Accept: */*..|
00000060  0d 0a                                              |..|```
kpdude7 commented 6 years ago

@ipapapa We have a script that runs periodically to check Dynomite health with a curl command (curl -sX GET http://{dynomiteHostIP}:22122/ping). This script lives on all 9 Dynomite instances and pings the other 8. If it detects that a node is down based on the response, it rebuilds the seeds in the .yml file and restarts that particular Dynomite instance (in other words, it does some of the work of Dynomite Manager because we're not using DM). Are you implying that this simple ping is what's causing our issue? Should we be using another port? Or is pinging dynomite periodically in this way not a good idea? @dev-sungk

saidatta commented 6 years ago

Do we need to still provide the token map as described here - Possibly related https://github.com/Netflix/dyno/issues/47 ?

kpdude7 commented 6 years ago

@v1r3n @ipapapa @dev-sungk Many thanks to @saidatta for his previous comment; the issue was indeed related to the above Dyno issue. In the current release of Conductor (v1.8.1, and I've also verified it's the same for all pre-releases since), on line 131 of ConductorServer.java, a call to ConnectionPoolConfigurationImpl.withTokenSupplier is made with a new TokenMapSupplier that has only one element in the activeHosts array; namely, a HostToken with value 1L. This totally ignores the Dynomite topology the user has setup, so if a hash just happens to come up in the range supported by token value 1, you're in luck; if not, you get NoAvailableHostException and your workflows start missing tasks in redis (like what we have been experiencing). In order to fix this, we had to modify 2 classes in the Dyno project, AbstractTokenMapSupplier and ConnectionPoolConfigurationImpl, to ignore the passed-in TokenMapSupplier, explode the conductor-server-all.jar, and replace these 2 classes. I have attached these source files for reference; below are the general changes I had to make:

AbstractTokenMapSupplier:

  1. Change the privacy of List parseTokenListFromJson to public so it could have an @Override in ConnectionPoolConfigurationImpl
  2. Change the signature of the above method to take a String hostAddress (more on this below)

ConnectionPoolConfigurationImpl:

  1. Modify the withTokenSupplier method to ignore the passed-in TokenMapSupplier a. Assign the tokenSupplier to be a new HttpEndpointBasedTokenMapSupplier, since this class already supports the mechanism to get the Dynomite topology b. Override the parseTokenListFromJson method from the abstract class, since the json returned by cluster_describe (at least that from Dynomite v0.5.9-5_MuslCompatibility, which is what we're using since it's the latest "release") is TOTALLY different than the format implied by the test class c. Replace the host '0.0.0.0' returned from the cluster_describe api with the new hostAddress input parameter (since HttpEndpointBasedTokenMapSupplier takes a random Dynomite host from the conductor.yml to make the cluster_describe call, and you have to be able to map the host that you're "on" to a token).

To be honest, we are quite frustrated that we are unable to simply implement the latest releases of Conductor and Dynomite out of the box and have them work together without modifying your source code; we have wasted a lot of time and effort trying to track this down. Nowhere in the documentation for either product is this requirement to override source code addressed. We are also disappointed that we did not receive any guidance on this issue; if not for saidatta's comment we may never have happened upon the solution.

Archive.zip

These changes need to be incorporated into the base product somehow (my changes as-is break some of your unit tests). I am leaving this issue open in case anyone else has any other comments or recommends a different way of accomplishing what we have done.

ipapapa commented 6 years ago

@kpdude7 I will have to read your latest post, but the error in the previous post seemed to be related to going to the Redis parser for REST calls. It may potentially be that it is an incorrect port. The error was pretty obvious in the logs.

We have been explicit when we published Dynomite-manager that this what we use to get the cluster description. There have been a few differences between what Dynomite provides in cluster_describe vs what Dynomite-manager provides. @shailesh33 has done some of the work to make it fairly compatible but we never moved it to production, so it is true there might a few discrepancies (and probably that is the reason of not having a clear documentation on it). Your feedback is therefore very valuable and it would be nice to file PR to fix the issues or provide recommendations on what the documentation should include. It might be that other users may have a similar issue.

saidatta commented 6 years ago

@kpdude7 After implementing the fix, out of curiosity, have you experienced this error while starting a dynamic fork within a workflow?

214322 [qtp557705922-32] ERROR com.netflix.conductor.server.resources.GenericExceptionMapper  - com.fasterxml.jackson.core.JsonStreamContext.<init>(II)V
java.lang.NoSuchMethodError: com.fasterxml.jackson.core.JsonStreamContext.<init>(II)V
    at com.fasterxml.jackson.databind.util.TokenBufferReadContext.<init>(TokenBufferReadContext.java:59)
    at com.fasterxml.jackson.databind.util.TokenBufferReadContext.createRootContext(TokenBufferReadContext.java:89)
    at com.fasterxml.jackson.databind.util.TokenBuffer$Parser.<init>(TokenBuffer.java:1298)
    at com.fasterxml.jackson.databind.util.TokenBuffer.asParser(TokenBuffer.java:276)
    at com.fasterxml.jackson.databind.util.TokenBuffer.asParser(TokenBuffer.java:242)
    at com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:3719)
    at com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:3666)
    at com.netflix.conductor.core.execution.DeciderService.getDynamicTasks(DeciderService.java:574)
    at com.netflix.conductor.core.execution.DeciderService.getTasksToBeScheduled(DeciderService.java:491)
    at com.netflix.conductor.core.execution.DeciderService.getTasksToBeScheduled(DeciderService.java:406)
    at com.netflix.conductor.core.execution.DeciderService.getNextTask(DeciderService.java:281)
    at com.netflix.conductor.core.execution.DeciderService.decide(DeciderService.java:156)
    at com.netflix.conductor.core.execution.DeciderService.decide(DeciderService.java:92)
    at com.netflix.conductor.core.execution.WorkflowExecutor.decide(WorkflowExecutor.java:513)
    at com.netflix.conductor.core.execution.WorkflowExecutor.updateTask(WorkflowExecutor.java:472)
    at com.netflix.conductor.service.ExecutionService.updateTask(ExecutionService.java:162)
    at com.netflix.conductor.server.resources.TaskResource.updateTask(TaskResource.java:129)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
    at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
    at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
    at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
    at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
    at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
    at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
    at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
    at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
    at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
    at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
    at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:286)
    at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:276)
    at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:181)
    at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
    at com.netflix.conductor.server.JerseyModule$1.doFilter(JerseyModule.java:99)
    at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
    at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
    at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
    at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1174)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1106)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
    at org.eclipse.jetty.server.Server.handle(Server.java:524)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:319)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
    at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
    at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
    at java.lang.Thread.run(Thread.java:748)
saidatta commented 6 years ago

@kpdude7 ignore my previous message. it was something to do with my dependencies conflict. it got resolved after i moved to the latest version.

saidatta commented 6 years ago

@kpdude7 BTW, I believe you did you have to explode tha jar file and add the tokens. Just taking the implemented TokenMapper class and re-place it in ConductorServer (where the default TokenMapper exists).

This gives you more freedom. For example, I hooked up the configurable tokens with System env. varaibles and you just have to change them while running the compiled jar without going through the re-build process again.

All that aside, I assume @ipapapa 's suggestion was to let "dynomite-manager" handle these changes.