Open YufengXin opened 4 years ago
(1) Adding a backup port in the manifest file approved working (2) now need to automate the computation of the new Spanning Tree
I will add some findings here. On RENCI Testbed setup, following steps are performed.
[root@atlanticwave-sdx-controller script-sdx]# ./curl-0.sh -c cookie-mcevik.txt -o get_policies
============================================================================== --- Get POLICY - http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies ============================================================================== { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies", "links": { "policy2": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/2", "policynumber": 2, "type": "FloodTree", "user": "SDXCTLR" }, "policy3": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/3", "policynumber": 3, "type": "EdgePort", "user": "SDXCTLR" }, "policy4": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/4", "policynumber": 4, "type": "EdgePort", "user": "SDXCTLR" }, "policy5": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/5", "policynumber": 5, "type": "EdgePort", "user": "SDXCTLR" }, "policy6": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/6", "policynumber": 6, "type": "EdgePort", "user": "SDXCTLR" }, "policy7": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/7", "policynumber": 7, "type": "EdgePort", "user": "SDXCTLR" } } }
3. Stop Local Controller at DUKE
- Policy #7 is removed and Policy #9 is pushed
[root@atlanticwave-sdx-controller script-sdx]# ./curl-0.sh -c cookie-mcevik.txt -o get_policies
============================================================================== --- Get POLICY - http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies ============================================================================== { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies", "links": { "policy2": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/2", "policynumber": 2, "type": "FloodTree", "user": "SDXCTLR" }, "policy3": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/3", "policynumber": 3, "type": "EdgePort", "user": "SDXCTLR" }, "policy4": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/4", "policynumber": 4, "type": "EdgePort", "user": "SDXCTLR" }, "policy5": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/5", "policynumber": 5, "type": "EdgePort", "user": "SDXCTLR" }, "policy6": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/6", "policynumber": 6, "type": "EdgePort", "user": "SDXCTLR" }, "policy9": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/9", "policynumber": 9, "type": "ManagementSDXRecover", "user": "SDXCTLR" } } }
[root@atlanticwave-sdx-controller script-sdx]# ./curl-0.sh -c cookie-mcevik.txt -o get_policy -N 9
============================================================================== --- Get POLICY - http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/9 ============================================================================== { "policy9": { "href": "http://atlanticwave-sdx-controller.renci.ben:5000/api/v1/policies/number/9", "json": { "ManagementSDXRecover": { "switch": "rencis1" } }, "policynumber": "9", "type": "ManagementSDXRecover", "user": "SDXCTLR" } }
Logs on SDX Controller
140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None 2020-07-29 16:03:53,924 sdxcontroller.usermanager: 140017185625856 INFO getting user: mcevik INFO:sdxcontroller.usermanager:getting user: mcevik INFO:werkzeug:192.168.201.156 - - [29/Jul/2020 16:03:53] "GET /api/v1/policies HTTP/1.1" 200 - 140017371850496 hb_response_handler: HBRESP: ['MAIN_PHASE']-None-None 140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None 140017371850496 hb_response_handler: HBRESP: ['MAIN_PHASE']-None-None 140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None 2020-07-29 16:03:59,320 sdxcontroller.usermanager: 140017185625856 INFO getting user: mcevik INFO:sdxcontroller.usermanager:getting user: mcevik INFO:werkzeug:192.168.201.156 - - [29/Jul/2020 16:03:59] "GET /api/v1/policies HTTP/1.1" 200 - 140017371850496 hb_response_handler: HBRESP: ['MAIN_PHASE']-None-None 140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None 2020-07-29 16:04:01,708 sdxcontroller.usermanager: 140017185625856 INFO getting user: mcevik INFO:sdxcontroller.usermanager:getting user: mcevik INFO:werkzeug:192.168.201.156 - - [29/Jul/2020 16:04:01] "GET /api/v1/policies HTTP/1.1" 200 - SDX Closing: Missing a heartbeat on 0x7f585009f2d0 SDX Heartbeat Closing due to error on 0x7f585009f2d0 ATTRIBUTE ERROR 'NoneType' object has no attribute 'recv' ON CXN Connection: address: 10.14.11.2 port: 34554 recv_cb: None recv_thread: None sock: None
2020-07-29 16:04:05,465 sdxcontroller: 140017371850496 WARNING Removing connection Connection: address: 10.14.11.2 port: 34554 recv_cb: None recv_thread: None sock: None
WARNING:sdxcontroller:Removing connection Connection: address: 10.14.11.2 port: 34554 recv_cb: None recv_thread: None sock: None
2020-07-29 16:04:05,465 sdxcontroller: 140017371850496 DEBUG Local Controller Lost connection: dukectlr DEBUG:sdxcontroller:Local Controller Lost connection: dukectlr 2020-07-29 16:04:05,468 sdxcontroller: 140017371850496 DEBUG Getting backup LC. DEBUG:sdxcontroller:Getting backup LC. 2020-07-29 16:04:05,468 sdxcontroller: 140017371850496 DEBUG Got backup LC: rencis1 DEBUG:sdxcontroller:Got backup LC: rencis1 2020-07-29 16:04:05,472 sdxcontroller.rulemanager: 140017371850496 DEBUG Sending remove breakdown: <shared.UserPolicy.UserPolicyBreakdown object at 0x7f5850e469d0> DEBUG:sdxcontroller.rulemanager:Sending remove breakdown: <shared.UserPolicy.UserPolicyBreakdown object at 0x7f5850e469d0> rule_rm_callback - EdgePort(dukes1) 2020-07-29 16:04:05,474 sdxcontroller.rulemanager: 140017371850496 INFO add_rule: Beging with rule: ManagementSDXRecover(rencis1) INFO:sdxcontroller.rulemanager:add_rule: Beging with rule: ManagementSDXRecover(rencis1) 2020-07-29 16:04:05,475 debug.sdxcontroller.rulemanager: 140017371850496 INFO add_rule: breakdowns [<shared.UserPolicy.UserPolicyBreakdown object at 0x7f585007a450>] INFO:debug.sdxcontroller.rulemanager:add_rule: breakdowns [<shared.UserPolicy.UserPolicyBreakdown object at 0x7f585007a450>] 2020-07-29 16:04:05,475 debug.sdxcontroller.rulemanager: 140017371850496 INFO add_rule: hash and cookies set to 9 INFO:debug.sdxcontroller.rulemanager:add_rule: hash and cookies set to 9 2020-07-29 16:04:05,475 debug.sdxcontroller.rulemanager: 140017371850496 INFO _add_rule_to_db: ManagementSDXRecover(rencis1):9 INFO:debug.sdxcontroller.rulemanager:_add_rule_to_db: ManagementSDXRecover(rencis1):9 2020-07-29 16:04:05,476 debug.sdxcontroller.rulemanager: 140017371850496 INFO ACTIVE_RULE INFO:debug.sdxcontroller.rulemanager: ACTIVE_RULE 2020-07-29 16:04:05,476 debug.sdxcontroller.rulemanager: 140017371850496 DEBUG _install_rule: ManagementSDXRecover(rencis1):9 DEBUG:debug.sdxcontroller.rulemanager:_install_rule: ManagementSDXRecover(rencis1):9 2020-07-29 16:04:05,476 sdxcontroller.rulemanager: 140017371850496 DEBUG Sending install breakdown: <shared.UserPolicy.UserPolicyBreakdown object at 0x7f585007a450> DEBUG:sdxcontroller.rulemanager:Sending install breakdown: <shared.UserPolicy.UserPolicyBreakdown object at 0x7f585007a450> 2020-07-29 16:04:05,476 sdxcontroller.rulemanager: 140017371850496 DEBUG ManagementSDXRecoverRule: switch 201 DEBUG:sdxcontroller.rulemanager: ManagementSDXRecoverRule: switch 201 rule_add_callback - ManagementSDXRecover(rencis1) 2020-07-29 16:04:05,479 debug.sdxcontroller.rulemanager: 140017371850496 INFO add_rule: Rule added to db: ManagementSDXRecover(rencis1) INFO:debug.sdxcontroller.rulemanager:add_rule: Rule added to db: ManagementSDXRecover(rencis1) 2020-07-29 16:04:05,479 sdxcontroller: 140017371850496 WARNING Removing connection Connection: address: 10.14.11.2 port: 34554 recv_cb: None recv_thread: None sock: None
WARNING:sdxcontroller:Removing connection Connection: address: 10.14.11.2 port: 34554 recv_cb: None recv_thread: None sock: None
140017371850496 hb_response_handler: HBRESP: ['MAIN_PHASE']-None-None 140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None
4. While rebuilding the VFC on the DUKE switch, I noticed that SDX controller crashed with the logs below.
For some reason, NCSU connection (10.14.11.4) is affected.
One datapoint for debugging is that some errors (or missing statements) may exist in the renci_ben.manifest for failover/backup ports. Some safety checks can be useful for configuration errors to prevent crashing of the system.
140017371850496 hb_request_handler: HBREQ: ['MAIN_PHASE']-None-None SDX Closing: Missing a heartbeat on 0x7f5850942390 SDX Heartbeat Closing due to error on 0x7f5850942390 2020-07-29 16:07:22,894 sdxcontroller: 140017371850496 WARNING Removing connection Connection: address: 10.14.11.4 port: 44498 recv_cb: None recv_thread: None sock: None
WARNING:sdxcontroller:Removing connection Connection: address: 10.14.11.4 port: 44498 recv_cb: None recv_thread: None sock: None
2020-07-29 16:07:22,895 sdxcontroller: 140017371850496 DEBUG Local Controller Lost connection: ncsuctlr
DEBUG:sdxcontroller:Local Controller Lost connection: ncsuctlr
2020-07-29 16:07:22,899 sdxcontroller: 140017371850496 DEBUG Getting backup LC.
DEBUG:sdxcontroller:Getting backup LC.
Traceback (most recent call last):
File "SDXController.py", line 423, in
With corrected manifest, exception above is not received anymore.
(1) LC shutdown and stateful recovery is validated in the RENCI testbed as the switches kept all the flow rules. (2) minuet runs into an issue due to port occupancy not cleared.
(3) in-disk database and manifest at the LC
(4) working on adding a backup port for the resiliency of management plane in the scenario of link failure, assuming one link failure at a time.