Closed kpdude7 closed 6 years ago
I am also posting this issue to the Dynomite forum.
@ipapapa
More info: All of our Dynomite instances are running on m3.medium instance types in AWS. We are running 30 assets simultaneously through a particular workflow (below) as a load test; 29 of them complete successfully, while 1 gets "stuck" because tasks are missing in the dynoqueues. This behavior is intermittent and doesn't appear to be directly related to load (number of assets being tested) or payload (amount of data passed in between tasks); we have seen it happen with as few as 8 assets being tested.
workflow def:
{ "name": "c3", "description": "<test>", "version": 1, "tasks": [ { "name": "config_task", "taskReferenceName": "config", "type": "SIMPLE", "inputParameters": { "accountId": "${workflow.input.accountNumber}" } },{ "name": "c3_asset_data_task", "taskReferenceName": "c3_asset_data_1", "type": "SIMPLE", "inputParameters": { "taskMethod": "recording_complete", "assetState": "C3_RECORDING_COMPLETE", "configs": "${config.output.configs}", "accountId": "${workflow.input.accountNumber}", "primaryFileName": "${workflow.input.primaryFileName}", "secondaryFileName": "${workflow.input.secondaryFileName}", "lowResFileName": "${workflow.input.lowResFileName}", "secLowResFileName": "${workflow.input.secLowResFileName}", "primaryAlternateFileName": "${workflow.input.primaryAlternateFileName}", "secondaryAlternateFileName": "${workflow.input.secondaryAlternateFileName}", "primaryFileTimestamp": "${workflow.input.primaryFileTimestamp}", "secondaryFileTimestamp": "${workflow.input.secondaryFileTimestamp}", "primaryAlternateFileTimestamp": "${workflow.input.primaryAlternateFileTimestamp}", "secondaryAlternateFileTimestamp": "${workflow.input.secondaryAlternateFileTimestamp}" } },{ "name": "content_based_decision_task", "taskReferenceName": "content_based_decision_task_1", "inputParameters": { "contentBasedAction": "${c3_asset_data_1.output.contentBasedAction}" }, "type": "DECISION", "caseValueParam": "contentBasedAction", "decisionCases": { "continue": [ { "name": "notification_task", "taskReferenceName": "notification_1", "type": "SIMPLE", "inputParameters": { "taskMethod": "capture_complete", "accountId": "${workflow.input.accountNumber}", "adiAssetId": "${c3_asset_data_1.output.adiAssetId}", "adiProvider": "${c3_asset_data_1.output.adiProvider}", "adiTitle": "${c3_asset_data_1.output.adiTitle}", "adiDescription": "${c3_asset_data_1.output.adiDescription}", "adiHouseId": "${c3_asset_data_1.output.adiHouseId}", "message": "${c3_asset_data_1.output.message}", "eventDate": "${c3_asset_data_1.output.eventDate}", "notification_provider": "C3" } },{ "name": "video_editing_task", "taskReferenceName": "video_edit_1", "type": "SIMPLE", "inputParameters": { "taskMethod": "iframeAndScte", "configs": "${config.output.configs}", "sourceFilePathName": "${c3_asset_data_1.output.sourceFilePath}", "accountId": "${workflow.input.accountNumber}", "assetUID": "${c3_asset_data_1.output.assetUID}", "providerId": "${c3_asset_data_1.output.adiProvider}", "nextWorkflow": "c3ScteComplete" } } ], "fail": [ { "name": "sub_workflow_task", "taskReferenceName": "c3Failed_2", "inputParameters": { "assetUID": "${c3_asset_data_1.output.assetUID}" }, "type": "SUB_WORKFLOW", "subWorkflowParam": { "name": "c3Failed", "version": 1 } } ], "generateLowRes": [ { "name": "notification_task", "taskReferenceName": "notification_2", "type": "SIMPLE", "inputParameters": { "taskMethod": "capture_complete", "accountId": "${workflow.input.accountNumber}", "adiAssetId": "${c3_asset_data_1.output.adiAssetId}", "adiProvider": "${c3_asset_data_1.output.adiProvider}", "adiTitle": "${c3_asset_data_1.output.adiTitle}", "adiDescription": "${c3_asset_data_1.output.adiDescription}", "adiHouseId": "${c3_asset_data_1.output.adiHouseId}", "message": "${c3_asset_data_1.output.message}", "eventDate": "${c3_asset_data_1.output.eventDate}", "notification_provider": "C3" } },{ "name": "video_editing_task", "taskReferenceName": "video_edit_5", "type": "SIMPLE", "inputParameters": { "taskMethod": "iframeAndScte", "configs": "${config.output.configs}", "sourceFilePathName": "${c3_asset_data_1.output.sourceFilePath}", "accountId": "${workflow.input.accountNumber}", "assetUID": "${c3_asset_data_1.output.assetUID}", "providerId": "${c3_asset_data_1.output.adiProvider}", "nextWorkflow": "c3ScteCompleteLowRes" } } ], "halt": [] } } ], "schemaVersion": 2 }
There are 3 keys that get written to redis for each workflow. Following are the values for a successful workflow completion:
conductor.test.WORKFLOW.902589fe-b3a9-4d6c-a147-c864588c174a
"{\"createTime\":1519917673928,\"updateTime\":1519917813748,\"status\":\"COMPLETED\",\"endTime\":1519917813748,\"workflowId\":\"902589fe-b3a9-4d6c-a147-c864588c174a\",\"input\":{\"md5HiResSecondary\":\"\",\"md5HiRes\":\"\",\"primaryFileName\":\"s3://atl-ops-sit-cf-c3/primary/SCIH0016358900200000_LT29.mpg\",\"accountNumber\":\"720\",\"md5LowRes\":\"\",\"lowResFileName\":\"s3://atl-ops-sit-cf-c3/primary/SCIH0016358900200000_LT29.mp4\"},\"output\":{\"s3ResultLoc\":\"s3://atl-ops-sit-cf-c3/primary/SCIH0016358900200000_LT29.mpg\"},\"workflowType\":\"c3\",\"version\":1,\"schemaVersion\":2,\"startTime\":1519917673928}"
conductor.test.WORKFLOW_TO_TASKS.902589fe-b3a9-4d6c-a147-c864588c174a 1) "a83f2b54-e87b-4bb8-9f90-0426c92d98f5" 2) "d9fccc23-aadc-4a75-943a-3dd3a5bf6ff7" 3) "bd271d67-0eb0-445a-ad2d-553627d9dd41" 4) "60a37f2c-c374-4235-8b9d-9b8fd5fe55b2" 5) "e819e738-b871-46e5-9bd6-52b5de160a0a"
conductor.test.SCHEDULED_TASKS.902589fe-b3a9-4d6c-a147-c864588c174a 1) "config0" 2) "60a37f2c-c374-4235-8b9d-9b8fd5fe55b2" 3) "c3_asset_data_10" 4) "bd271d67-0eb0-445a-ad2d-553627d9dd41" 5) "content_based_decision_task_10" 6) "d9fccc23-aadc-4a75-943a-3dd3a5bf6ff7" 7) "notification_10" 8) "a83f2b54-e87b-4bb8-9f90-0426c92d98f5" 9) "video_edit_10" 10) "e819e738-b871-46e5-9bd6-52b5de160a0a"
Following are values for an unsuccessful completion: conductor.test.WORKFLOW.2f7ca9c5-ec40-49f3-90b9-4b3aa72dd642 {\"createTime\":1519917674114,\"updateTime\":1519917760307,\"status\":\"RUNNING\",\"endTime\":0,\"workflowId\":\"2f7ca9c5-ec40-49f3-90b9-4b3aa72dd642\",
Your Dynomite topology seems to be off that is causing the Dyno client to have an issue with sending the data. Use the cluster_describe REST call to get a view of what Dynomite sees at runtime versus what you have provided in the YAML.
@ipapapa Below is a full list of each of our 9 dynomite.yml files, along with the corresponding output of curl http://localhost:22122/cluster_describe on the same box (we're using 22122 instead of 22222). We have examined this doc and don't see any discrepancy between each yml and each cluster_describe; can you give us guidance on what exactly may be wrong?
IP 10.240.71.13 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1a listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1a", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }
IP 10.240.71.32 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1a listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1a", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }
IP 10.240.71.21 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1a listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1a", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 3333333333 }, { "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }
IP 10.240.71.91 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1c listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1c", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }
IP 10.240.71.109 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1c listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1c", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 3333333333 }, { "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }
IP 10.240.71.99 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1c listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1c", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1d", "servers": [{ "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }] }] }
IP 10.240.71.159 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1d listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1d", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 3333333333 }, { "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }] }] }
IP 10.240.71.138 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1d listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1d", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.179", "host": "10.240.71.179", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }] }] }
IP 10.240.71.179 dynomite.yml: dyn_o_mite: datacenter: us-east-1 rack: us-east-1d listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
http://localhost:22122/cluster_describe: { "dcs": [{ "name": "us-east-1", "racks": [{ "name": "us-east-1d", "servers": [{ "name": "0.0.0.0", "host": "0.0.0.0", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.138", "host": "10.240.71.138", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.159", "host": "10.240.71.159", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1a", "servers": [{ "name": "10.240.71.13", "host": "10.240.71.13", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.32", "host": "10.240.71.32", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.21", "host": "10.240.71.21", "port": 8101, "token": 3333333333 }] }, { "name": "us-east-1c", "servers": [{ "name": "10.240.71.91", "host": "10.240.71.91", "port": 8101, "token": 1111111111 }, { "name": "10.240.71.99", "host": "10.240.71.99", "port": 8101, "token": 2222222222 }, { "name": "10.240.71.109", "host": "10.240.71.109", "port": 8101, "token": 3333333333 }] }] }] }
Hi @ipapapa, I work with OP and I would also like the append additional stacktraces to maybe get a better guidance as to what's happening.
Conductor_NoAvailableHostsException_Stacktrace.txt Dynomite_Stacktrace.txt
I am encountering a similar error with 2 nodes.
@dev-sungk looking into the Dynomite stacktrace, somebody is sending HTTP requests to the Dynomite/Redis port. Hence the redis parser in Dynomite cannot deserialize that request.
[2018-03-02 19:52:12.792] redis_parse_req:1836 parsed bad req 1171817 res 1 type 0 state 0
00000000 47 45 54 20 2f 76 65 72 73 69 6f 6e 20 48 54 54 |GET /version HTT|
00000010 50 2f 31 2e 31 0d 0a 55 73 65 72 2d 41 67 65 6e |P/1.1..User-Agen|
00000020 74 3a 20 63 75 72 6c 2f 37 2e 33 38 2e 30 0d 0a |t: curl/7.38.0..|
00000030 48 6f 73 74 3a 20 6c 6f 63 61 6c 68 6f 73 74 3a |Host: localhost:|
00000040 38 31 30 32 0d 0a 41 63 63 65 70 74 3a 20 2a 2f |8102..Accept: */|
00000050 2a 0d 0a 0d 0a |*....|
and
00000000 47 45 54 20 2f 52 45 53 54 2f 76 31 2f 61 64 6d |GET /REST/v1/adm|
00000010 69 6e 2f 73 74 61 74 75 73 20 48 54 54 50 2f 31 |in/status HTTP/1|
00000020 2e 31 0d 0a 55 73 65 72 2d 41 67 65 6e 74 3a 20 |.1..User-Agent: |
00000030 63 75 72 6c 2f 37 2e 33 38 2e 30 0d 0a 48 6f 73 |curl/7.38.0..Hos|
00000040 74 3a 20 6c 6f 63 61 6c 68 6f 73 74 3a 38 31 30 |t: localhost:810|
00000050 32 0d 0a 41 63 63 65 70 74 3a 20 2a 2f 2a 0d 0a |2..Accept: */*..|
00000060 0d 0a |..|```
@ipapapa We have a script that runs periodically to check Dynomite health with a curl command (curl -sX GET http://{dynomiteHostIP}:22122/ping). This script lives on all 9 Dynomite instances and pings the other 8. If it detects that a node is down based on the response, it rebuilds the seeds in the .yml file and restarts that particular Dynomite instance (in other words, it does some of the work of Dynomite Manager because we're not using DM). Are you implying that this simple ping is what's causing our issue? Should we be using another port? Or is pinging dynomite periodically in this way not a good idea? @dev-sungk
Do we need to still provide the token map as described here - Possibly related https://github.com/Netflix/dyno/issues/47 ?
@v1r3n @ipapapa @dev-sungk Many thanks to @saidatta for his previous comment; the issue was indeed related to the above Dyno issue. In the current release of Conductor (v1.8.1, and I've also verified it's the same for all pre-releases since), on line 131 of ConductorServer.java, a call to ConnectionPoolConfigurationImpl.withTokenSupplier is made with a new TokenMapSupplier that has only one element in the activeHosts array; namely, a HostToken with value 1L. This totally ignores the Dynomite topology the user has setup, so if a hash just happens to come up in the range supported by token value 1, you're in luck; if not, you get NoAvailableHostException and your workflows start missing tasks in redis (like what we have been experiencing). In order to fix this, we had to modify 2 classes in the Dyno project, AbstractTokenMapSupplier and ConnectionPoolConfigurationImpl, to ignore the passed-in TokenMapSupplier, explode the conductor-server-all.jar, and replace these 2 classes. I have attached these source files for reference; below are the general changes I had to make:
AbstractTokenMapSupplier:
ConnectionPoolConfigurationImpl:
To be honest, we are quite frustrated that we are unable to simply implement the latest releases of Conductor and Dynomite out of the box and have them work together without modifying your source code; we have wasted a lot of time and effort trying to track this down. Nowhere in the documentation for either product is this requirement to override source code addressed. We are also disappointed that we did not receive any guidance on this issue; if not for saidatta's comment we may never have happened upon the solution.
These changes need to be incorporated into the base product somehow (my changes as-is break some of your unit tests). I am leaving this issue open in case anyone else has any other comments or recommends a different way of accomplishing what we have done.
@kpdude7 I will have to read your latest post, but the error in the previous post seemed to be related to going to the Redis parser for REST calls. It may potentially be that it is an incorrect port. The error was pretty obvious in the logs.
We have been explicit when we published Dynomite-manager that this what we use to get the cluster description. There have been a few differences between what Dynomite provides in cluster_describe
vs what Dynomite-manager provides. @shailesh33 has done some of the work to make it fairly compatible but we never moved it to production, so it is true there might a few discrepancies (and probably that is the reason of not having a clear documentation on it). Your feedback is therefore very valuable and it would be nice to file PR to fix the issues or provide recommendations on what the documentation should include. It might be that other users may have a similar issue.
@kpdude7 After implementing the fix, out of curiosity, have you experienced this error while starting a dynamic fork within a workflow?
214322 [qtp557705922-32] ERROR com.netflix.conductor.server.resources.GenericExceptionMapper - com.fasterxml.jackson.core.JsonStreamContext.<init>(II)V
java.lang.NoSuchMethodError: com.fasterxml.jackson.core.JsonStreamContext.<init>(II)V
at com.fasterxml.jackson.databind.util.TokenBufferReadContext.<init>(TokenBufferReadContext.java:59)
at com.fasterxml.jackson.databind.util.TokenBufferReadContext.createRootContext(TokenBufferReadContext.java:89)
at com.fasterxml.jackson.databind.util.TokenBuffer$Parser.<init>(TokenBuffer.java:1298)
at com.fasterxml.jackson.databind.util.TokenBuffer.asParser(TokenBuffer.java:276)
at com.fasterxml.jackson.databind.util.TokenBuffer.asParser(TokenBuffer.java:242)
at com.fasterxml.jackson.databind.ObjectMapper._convert(ObjectMapper.java:3719)
at com.fasterxml.jackson.databind.ObjectMapper.convertValue(ObjectMapper.java:3666)
at com.netflix.conductor.core.execution.DeciderService.getDynamicTasks(DeciderService.java:574)
at com.netflix.conductor.core.execution.DeciderService.getTasksToBeScheduled(DeciderService.java:491)
at com.netflix.conductor.core.execution.DeciderService.getTasksToBeScheduled(DeciderService.java:406)
at com.netflix.conductor.core.execution.DeciderService.getNextTask(DeciderService.java:281)
at com.netflix.conductor.core.execution.DeciderService.decide(DeciderService.java:156)
at com.netflix.conductor.core.execution.DeciderService.decide(DeciderService.java:92)
at com.netflix.conductor.core.execution.WorkflowExecutor.decide(WorkflowExecutor.java:513)
at com.netflix.conductor.core.execution.WorkflowExecutor.updateTask(WorkflowExecutor.java:472)
at com.netflix.conductor.service.ExecutionService.updateTask(ExecutionService.java:162)
at com.netflix.conductor.server.resources.TaskResource.updateTask(TaskResource.java:129)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
at com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
at com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
at com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:302)
at com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
at com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
at com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1542)
at com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1473)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1419)
at com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1409)
at com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:409)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:558)
at com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:733)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
at com.google.inject.servlet.ServletDefinition.doServiceImpl(ServletDefinition.java:286)
at com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:276)
at com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:181)
at com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:85)
at com.netflix.conductor.server.JerseyModule$1.doFilter(JerseyModule.java:99)
at com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:82)
at com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:120)
at com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:135)
at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1676)
at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:581)
at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1174)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:511)
at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1106)
at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:134)
at org.eclipse.jetty.server.Server.handle(Server.java:524)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:319)
at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:253)
at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:273)
at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:95)
at org.eclipse.jetty.io.SelectChannelEndPoint$2.run(SelectChannelEndPoint.java:93)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:303)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
at java.lang.Thread.run(Thread.java:748)
@kpdude7 ignore my previous message. it was something to do with my dependencies conflict. it got resolved after i moved to the latest version.
@kpdude7 BTW, I believe you did you have to explode tha jar file and add the tokens. Just taking the implemented TokenMapper class and re-place it in ConductorServer (where the default TokenMapper exists).
This gives you more freedom. For example, I hooked up the configurable tokens with System env. varaibles and you just have to change them while running the compiled jar without going through the re-build process again.
All that aside, I assume @ipapapa 's suggestion was to let "dynomite-manager" handle these changes.
We currently have 3 Conductor instances backed by an ELB in AWS, one for each of 3 AZ's. We also have 3 Dynomite instances per AZ, for a total of 9. When we test assets through our workflows, sometimes an asset will get stuck in a workflow, and when viewing the workflow via the API, steps are just missing at the end. We also see periodic NoAvailableHostsExceptions in the Conductor logs:
Feb 21 16:43:36 ip-10-240-71-124 sh[17714]: %93005 [qtp1041451158-52] ERROR com.netflix.conductor.server.resources.GenericExceptionMapper - NoAvailableHostsException: [host=Host [hostname=UNKNOWN, ipAddress=UNKNOWN, port=0, rack: UNKNOWN, datacenter: UNKNOW, status: Down], latency=0(0), attempts=0]Token not found for key hash: 151863863 Feb 21 16:43:36 ip-10-240-71-124 sh[17714]: %com.netflix.dyno.connectionpool.exception.NoAvailableHostsException: NoAvailableHostsException: [host=Host [hostname=UNKNOWN, ipAddress=UNKNOWN, port=0, rack: UNKNOWN, datacenter: UNKNOW, status: Down], latency=0(0), attempts=0]Token not found for key hash: 151863863 Feb 21 16:43:36 ip-10-240-71-124 sh[17714]: at com.netflix.dyno.connectionpool.impl.hash.BinarySearchTokenMapper.getToken(BinarySearchTokenMapper.java:68)
We are running Conductor 1.8.2, Dynomite dynomite-v0.5.9-5_MuslCompatiblity, and redis 3.2.10. Below are representative conductor.yml and dynomite.yml files:
conductor.yml:
db=dynomite workflow.dynomite.cluster.hosts=10.240.71.13:8102:us-east-1a;10.240.71.32:8102:us-east-1a;10.240.71.21:8102:us-east-1a;10.240.71.91:8102:us-east-1c;10.240.71.99:8102:us-east-1c;10.240.71.109:8102:us-east-1c;10.240.71.138:8102:us-east-1d;10.240.71.179:8102:us-east-1d;10.240.71.159:8102:us-east-1d workflow.dynomite.cluster.name=dynomite_cluster_sit workflow.namespace.prefix=conductor workflow.namespace.queue.prefix=conductor_queues_sit queues.dynomite.threads=100 queues.dynomite.nonQuorum.port=22122 workflow.elasticsearch.url=10.240.71.30:9300 workflow.elasticsearch.index.name=conductor server.connection-timeout=60000 workflow.system.task.worker.poll.count=25 workflow.system.task.worker.thread.count=25 logging.level.com.netflix.conductor=INFO EC2_AVAILABILITY_ZONE=us-east-1d
dynomite.yml: `dyn_o_mite: datacenter: us-east-1 rack: us-east-1a listen: 0.0.0.0:8102 dyn_listen: 0.0.0.0:8101 dyn_seed_provider: simple_provider dyn_seeds:
During one of these occurrences, I noticed that of our 9 dynomites, one of them (10.240.71.16) was missing one of the dyn_seeds (10.240.71.70); when I looked at the log for 16, I saw:
[2018-02-20 19:53:46.622] dnode_accept:168 Accepting client connection from 10.240.71.70:62472 on sd 22 [2018-02-20 19:53:46.622] event_add_conn:207 adding conn <LOCAL_PEER_CLIENT 0x1527ba0 22 from '10.240.71.70:62472'> to active [2018-02-20 19:53:46.622] dnode_accept:210 <PEER_PROXY 0x14c74d0 12 listening on '0.0.0.0:8101'> accepted <LOCAL_PEER_CLIENT 0x1527ba0 22 from '10.240.71.70:62472'> [2018-02-20 19:53:46.622] event_del_out:181 removing conn <LOCAL_PEER_CLIENT 0x1527ba0 22 from '10.240.71.70:62472'> from active [2018-02-20 20:04:32.876] dnode_accept:168 Accepting client connection from 10.240.71.70:63703 on sd 22 [2018-02-20 20:04:32.876] event_add_conn:207 adding conn <LOCAL_PEER_CLIENT 0x888ef0 22 from '10.240.71.70:63703'> to active [2018-02-20 20:04:32.876] dnode_accept:210 <PEER_PROXY 0x8874d0 12 listening on '0.0.0.0:8101'> accepted <LOCAL_PEER_CLIENT 0x888ef0 22 from '10.240.71.70:63703'> [2018-02-20 20:04:32.876] event_del_out:181 removing conn <LOCAL_PEER_CLIENT 0x888ef0 22 from '10.240.71.70:63703'> from active
I'm not sure why these connections keep dropping, but this dynomite instance never re-accepted the connection from 70.
We have tried everything under the sun to address this issue, from increasing mbuf_size to decreasing our payload from one workflow task step to the next, and still see this issue.