fabric-testbed / fabfed

FABRIC Tool-based Federation Kit for a Testbed of Testbeds
MIT License
3 stars 0 forks source link

Concurrency Issues #82

Closed zlion closed 6 months ago

zlion commented 10 months ago

2024-01-06 16:12:04,525 [controller.py:171] [ERROR] 400 POST https://compute.googleapis.com/compute/v1/projects/fabfed/regions/us-east4/routers: Exceeded maximum allowed routers in same network and region: 5.

Solution: Use extra VPCs to host more than 5 routers.

Creating VLAN attachments failed. Error: Operation type [insert] failed with message "Quota 'INTERCONNECT_ATTACHMENTS_PER_REGION' exceeded. Limit: 16.0 in region us-east4."

2024-01-15 14:42:57,710 [controller.py:171] [ERROR] An error occurred (VpnGatewayLimitExceeded) when calling the CreateVpnGateway operation: The maximum number of virtual private gateways has been reached.

2024-01-06 16:17:34,010 [transport.py:1873] [ERROR] Secsh channel 0 open FAILED: Connection refused: Connect failed
2024-01-06 16:17:54,875 [controller.py:171] [ERROR] Please Apply again. Fabric slice gcp-11 has exception during post networking setup: ChannelException(2, 'Connect failed')

More details:

2024-01-13 10:18:16,031 [transport.py:1873] [ERROR] Secsh channel 0 open FAILED: Connection refused: Connect failed
2024-01-13 10:18:16,033 [node.py:1537] [WARNING] Exception in node.execute() (attempt #0 of 3): ChannelException(2, 'Connect failed')
2024-01-13 10:18:26,037 [node.py:1559] [DEBUG] Exception in client.close(): local variable 'client' referenced before assignment
2024-01-13 10:18:26,038 [node.py:1564] [DEBUG] Exception in bastion_channel.close(): local variable 'bastion_channel' referenced before assignment
2024-01-13 10:18:26,732 [transport.py:1873] [ERROR] Secsh channel 0 open FAILED: Connection refused: Connect failed
2024-01-13 10:18:26,733 [node.py:1537] [WARNING] Exception in node.execute() (attempt #1 of 3): ChannelException(2, 'Connect failed')
2024-01-13 10:18:36,736 [node.py:1559] [DEBUG] Exception in client.close(): local variable 'client' referenced before assignment
2024-01-13 10:18:36,737 [node.py:1564] [DEBUG] Exception in bastion_channel.close(): local variable 'bastion_channel' referenced before assignment
2024-01-13 10:18:37,252 [transport.py:1873] [ERROR] Secsh channel 0 open FAILED: Connection refused: Connect failed
2024-01-13 10:18:37,253 [node.py:1537] [WARNING] Exception in node.execute() (attempt #2 of 3): ChannelException(2, 'Connect failed')
2024-01-13 10:18:37,254 [node.py:1559] [DEBUG] Exception in client.close(): local variable 'client' referenced before assignment
2024-01-13 10:18:37,254 [node.py:1564] [DEBUG] Exception in bastion_channel.close(): local variable 'bastion_channel' referenced before assignment
2024-01-13 10:18:37,255 [node.py:2279] [WARNING] Failed to stop network manager: ChannelException(2, 'Connect failed')
2024-01-13 10:18:37,255 [controller.py:171] [ERROR] Please Apply again. Fabric slice aws-1 has exception during post networking setup: ChannelException(2, 'Connect failed')
Traceback (most recent call last):
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 311, in _handle_node_networking
    self._do_handle_node_networking()
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 272, in _do_handle_node_networking
    delegate.network_manager_stop()
  File "/Users/lzhang9/opt/anaconda3/envs/fabfed/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py", line 2280, in network_manager_stop
    raise e
  File "/Users/lzhang9/opt/anaconda3/envs/fabfed/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py", line 2271, in network_manager_stop
    stdout, stderr = self.execute(
  File "/Users/lzhang9/opt/anaconda3/envs/fabfed/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py", line 1542, in execute
    raise e
  File "/Users/lzhang9/opt/anaconda3/envs/fabfed/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/node.py", line 1402, in execute
    bastion_channel = bastion_transport.open_channel(
  File "/Users/lzhang9/opt/anaconda3/envs/fabfed/lib/python3.9/site-packages/paramiko/transport.py", line 1067, in open_channel
    raise e
paramiko.ssh_exception.ChannelException: ChannelException(2, 'Connect failed')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/controller/controller.py", line 168, in create
    provider.create_resource(resource=resource.attributes)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 178, in create_resource
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 175, in create_resource
    self.do_create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_provider.py", line 94, in do_create_resource
    self.slice.create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 348, in create_resource
    self._handle_node_networking()
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 313, in _handle_node_networking
    raise Exception(
Exception: Please Apply again. Fabric slice aws-1 has exception during post networking setup: ChannelException(2, 'Connect failed')
2024-01-13 10:18:37,259 [controller.py:171] [ERROR] fabric slice aws-1 has some failures. Refusing to create fabric_network@network
Traceback (most recent call last):
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/controller/controller.py", line 168, in create
    provider.create_resource(resource=resource.attributes)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 178, in create_resource
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 175, in create_resource
    self.do_create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_provider.py", line 94, in do_create_resource
    self.slice.create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 320, in create_resource
    raise Exception(f"fabric slice {self.name} has some failures. Refusing to create {label}")
Exception: fabric slice aws-1 has some failures. Refusing to create fabric_network@network

Solution: pending

2024-01-06 16:15:51,056 [controller.py:171] [ERROR] Slice Exception: Slice Name: gcp-12, Slice ID: 99c3b195-67a0-4bd0-8180-14ee36d8d8de: Slice Exception: Slice Name: gcp-12, Slice ID: 99c3b195-67a0-4bd0-8180-14ee36d8d8de: Node: fabric_node0, Site: MAX, State: Active, 
Slice Exception: Slice Name: gcp-12, Slice ID: 99c3b195-67a0-4bd0-8180-14ee36d8d8de: Slice Exception: Slice Name: gcp-12, Slice ID: 99c3b195-67a0-4bd0-8180-14ee36d8d8de: Node: fabric_node0, Site: MAX, State: Active, 
failed lease update- all units failed priming: Exception during create for unit: 4f68ad6f-5ba7-4a3f-a70f-506dbb84f895 (PlaybookException(Playbook has failed tasks: status_code: 400, status: Bad Request, title: Invalid request, detail: The request was understood but invalid.), [core1.wash:HundredGigE0/0/0/28, agg4.ashb:Bundle-Ether5])#all units failed priming: Exception during create for unit: 4f68ad6f-5ba7-4a3f-a70f-506dbb84f895 (PlaybookException(Playbook has failed tasks: status_code: 400, status: Bad Request, title: Invalid request, detail: The request was understood but invalid.), [core1.wash:HundredGigE0/0/0/28, agg4.ashb:Bundle-Ether5])#
Slice Exception: Slice Name: gcp-12, Slice ID: 99c3b195-67a0-4bd0-8180-14ee36d8d8de: Slice Exception: Slice Name: gcp-12, Slice ID: 99c3b195-67a0-4bd0-8180-14ee36d8d8de: Node: fabric_node0, Site: MAX, State: Active, 
Slice Exception: Slice Name: gcp-12, Slice ID: 99c3b195-67a0-4bd0-8180-14ee36d8d8de: Slice Exception: Slice Name: gcp-12, Slice ID: 99c3b195-67a0-4bd0-8180-14ee36d8d8de: Node: fabric_node0, Site: MAX, State: Active, 
failed lease update- all units failed priming: Exception during create for unit: 4f68ad6f-5ba7-4a3f-a70f-506dbb84f895 (PlaybookException(Playbook has failed tasks: status_code: 400, status: Bad Request, title: Invalid request, detail: The request was understood but invalid.), [core1.wash:HundredGigE0/0/0/28, agg4.ashb:Bundle-Ether5])#all units failed priming: Exception during create for unit: 4f68ad6f-5ba7-4a3f-a70f-506dbb84f895 (PlaybookException(Playbook has failed tasks: status_code: 400, status: Bad Request, title: Invalid request, detail: The request was understood but invalid.), [core1.wash:HundredGigE0/0/0/28, agg4.ashb:Bundle-Ether5])#

Solution: need to reproduce

2024-01-06 21:42:41,334 [fabric_slice.py:169] [INFO] Waiting for slice gcp-4 to be stable
2024-01-06 21:48:48,513 [controller.py:171] [ERROR]  Timeout exceeded (360 sec). Slice: gcp-4 (Configuring)
Traceback (most recent call last):
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/controller/controller.py", line 168, in create
    provider.create_resource(resource=resource.attributes)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 178, in create_resource
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 175, in create_resource
    self.do_create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_provider.py", line 77, in do_create_resource
    self.slice.create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 338, in create_resource
    self._submit_and_wait()
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 191, in _submit_and_wait
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 170, in _submit_and_wait
    self.slice_object.wait(progress=True)
  File "/Users/lzhang9/opt/anaconda3/envs/fabfed/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py", line 1518, in wait
    raise Exception(
Exception:  Timeout exceeded (360 sec). Slice: gcp-4 (Configuring)
2024-01-06 21:48:48,521 [fabfed.py:84] [ERROR] Exceptions while creating resources ... Number Of Exceptions=1:[
msg= Timeout exceeded (360 sec). Slice: gcp-4 (Configuring)

Solution: Increase the time out to 24*360? in the fabfed code.

2024-01-07 16:06:47,146 [controller.py:171] [ERROR] Slice Exception: Slice Name: aws-1, Slice ID: b271caa4-dd56-4688-b489-2c6ed4ad7bcc: Slice Exception: Slice Name: aws-1, Slice ID: b271caa4-dd56-4688-b489-2c6ed4ad7bcc: Node: fabric_node0, Site: MAX, State: Active, 

failed lease update- all units failed priming: Exception during create for unit: 415b6af4-6c45-43c8-b89d-6ea7a1679c74 (PlaybookException(Playbook has failed tasks: Status code was -1 and not [200]: Connection failure: The read operation timed out), [agg3.ashb:TenGigE0/0/0/11/3, core1.wash:HundredGigE0/0/0/28])#all units failed priming: Exception during create for unit: 415b6af4-6c45-43c8-b89d-6ea7a1679c74 (PlaybookException(Playbook has failed tasks: Status code was -1 and not [200]: Connection failure: The read operation timed out), [agg3.ashb:TenGigE0/0/0/11/3, core1.wash:HundredGigE0/0/0/28])#

Solution: refer to 7

2024-01-08 00:26:31,396 [controller.py:171] [ERROR] Slice Exception: Slice Name: aws-1, Slice ID: 45aa7a2c-d1af-4b32-b4d5-758af6465346: Slice Exception: Slice Name: aws-1, Slice ID: 45aa7a2c-d1af-4b32-b4d5-758af6465346: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: c2712582-c579-4adb-893b-2a6c80204352 Playbook has failed tasks: Error in creating the server (no further information available)#all units failed priming: Exception during create for unit: c2712582-c579-4adb-893b-2a6c80204352 Playbook has failed tasks: Error in creating the server (no further information available)#
Slice Exception: Slice Name: aws-1, Slice ID: 45aa7a2c-d1af-4b32-b4d5-758af6465346: Slice Exception: Slice Name: aws-1, Slice ID: 45aa7a2c-d1af-4b32-b4d5-758af6465346: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: c2712582-c579-4adb-893b-2a6c80204352 Playbook has failed tasks: Error in creating the server (no further information available)#all units failed priming: Exception during create for unit: c2712582-c579-4adb-893b-2a6c80204352 Playbook has failed tasks: Error in creating the server (no further information available)#

Solution: Komal finds out it is issue with VMs on MAX site

Your VM was created on MAX, libvirt logs suggest that the disk for the VM failed and hence openstack terminated. Unfortunately, openstack doesn't return this information back to the am so the error message is not clear.

You should have access to the Site Head Nodes in this case [max-hn.fabric-testbed.net](http://max-hn.fabric-testbed.net/) and then looking for logs on the specific worker to debug. If you don't have access to head nodes, please check with Mert/Hussam.

2024-01-11 10:19:50,526 [fabric_slice.py:177] [WARNING] Exception occurred while waiting state=StableError:Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#
Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#

Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#
Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#

2024-01-11 10:19:50,526 [controller.py:171] [ERROR] Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#
Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#

Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#
Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#

Traceback (most recent call last):
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/controller/controller.py", line 168, in create
    provider.create_resource(resource=resource.attributes)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 178, in create_resource
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 175, in create_resource
    self.do_create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_provider.py", line 94, in do_create_resource
    self.slice.create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 346, in create_resource
    self._submit_and_wait()
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 199, in _submit_and_wait
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 178, in _submit_and_wait
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 174, in _submit_and_wait
    self.slice_object.wait(timeout=24 * 60, progress=True)
  File "/Users/lzhang9/opt/anaconda3/envs/fabfed/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py", line 1509, in wait
    raise Exception(str(exception_string))
Exception: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#
Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Slice Exception: Slice Name: gcp-17, Slice ID: 87fb0a08-1753-4bd7-b8be-b26b1952d52d: Node: fabric_node0, Site: MAX, State: Closed, failed lease update- all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#all units failed priming: Exception during create for unit: db859b69-1a99-4db4-a01f-c1ec3fafd79e Playbook has failed tasks: Timeout waiting for the server to come up.#

This is related to 5 or 6 above. The FabFed reported an issue without details. The log on AL2Sam server shows more details as following.

2024-01-09 22:50:33,177 - al2s-am-handler - {ansible_helper.py:145} - [TickEvent-89]- INFO - {"changed": false, "elapsed": 30, "msg": "Status code was -1 and not [200]: Connection failure: The read operation timed out", "redirected": false, "status": -1, "url": "https://api.ns.internet2.edu/v1/footprint/cloudconnect"}

More log details are as follows.

2024-01-09 22:49:40,789 - al2s-am-handler - {handler_base.py:53} - [TickEvent-89]- DEBUG - Reading config file: /etc/fabric/actor/config/al2s_handler_config.yml
2024-01-09 22:49:40,796 - al2s-am-handler - {al2s_handler.py:174} - [TickEvent-89]- DEBUG - Executing playbook /etc/fabric/actor/playbooks/al2s.yml to create Network Service
2024-01-09 22:50:33,176 - al2s-am-handler - {ansible_helper.py:225} - [TickEvent-89]- DEBUG - Playbook result: 2
2024-01-09 22:50:33,176 - al2s-am-handler - {ansible_helper.py:231} - [TickEvent-89]- DEBUG - OK:
2024-01-09 22:50:33,176 - al2s-am-handler - {ansible_helper.py:145} - [TickEvent-89]- INFO - {"changed": false, "include": "list-cloudconnects.yml", "include_args": {}}
2024-01-09 22:50:33,177 - al2s-am-handler - {ansible_helper.py:233} - [TickEvent-89]- DEBUG - Failed:
2024-01-09 22:50:33,177 - al2s-am-handler - {ansible_helper.py:145} - [TickEvent-89]- INFO - {"changed": false, "elapsed": 30, "msg": "Status code was -1 and not [200]: Connection failure: The read operation timed out", "redirected": false, "status": -1, "url": "https://api.ns.internet2.edu/v1/footprint/cloudconnect"}
2024-01-09 22:50:33,177 - al2s-am-handler - {ansible_helper.py:235} - [TickEvent-89]- DEBUG - Unreachable:
2024-01-09 22:50:33,180 - al2s-am-handler - {al2s_handler.py:276} - [TickEvent-89]- DEBUG - Executing playbook /etc/fabric/actor/playbooks/al2s.yml to delete Network Service
2024-01-09 22:54:16,043 - al2s-am-handler - {ansible_helper.py:225} - [TickEvent-89]- DEBUG - Playbook result: 0
2024-01-09 22:54:16,044 - al2s-am-handler - {ansible_helper.py:231} - [TickEvent-89]- DEBUG - OK:
2024-01-09 22:54:16,044 - al2s-am-handler - {ansible_helper.py:145} - [TickEvent-89]- INFO - {
    "ansible_env.HOME": "/root",
    "changed": false
}
2024-01-09 22:54:16,044 - al2s-am-handler - {ansible_helper.py:233} - [TickEvent-89]- DEBUG - Failed:
2024-01-09 22:54:16,044 - al2s-am-handler - {ansible_helper.py:145} - [TickEvent-89]- INFO - {"changed": false, "connection": "close", "content_length": "952", "content_type": "application/json", "cookies": {}, "cookies_string": "", "date": "Tue, 09 Jan 2024 22:54:01 GMT", "elapsed": 2, "failed_when_result": true, "json": {"actions": ["vspace_add_collaborator", "vspace_create", "vspace_delete", "vspace_remove_collaborator", "vspace_retrieve", "vspace_update"], "cloudConnections": {"aws": false, "azure": false, "gcp": false, "oci": false}, "collaborators": [], "created": {"at": "2024-01-09T22:49:44.768991+00:00", "by": {"email": "(missing)", "name": "(missing)", "personCode": "0726-9d69-61e8-4c17-f9bd-f645-4f53-5cea-e4b1-6e8a-d515-2b73-f27e-c0ff-a311-3804"}}, "display": "VNSPACE-11101", "gnocdbServiceId": null, "modified": {"at": "2024-01-09T22:49:44.768991+00:00", "by": {"email": "(missing)", "name": "(missing)", "personCode": "0726-9d69-61e8-4c17-f9bd-f645-4f53-5cea-e4b1-6e8a-d515-2b73-f27e-c0ff-a311-3804"}}, "name": 11101, "notes": "", "ok": true, "owner": {"membershipCategory": "Undefined", "orgName": "FABRIC", "orgPath": "i2/netres/fabric"}, "title": "aws-6-296256999979", "virtualRouterIds": ["74fe0d81-2520-44bf-8dd0-6a8aa8998df1"], "virtualSpaceId": "0a2469b0-a68d-4f97-b6c3-63450654c7df", "virtualSwitchIds": []}, "msg": "OK (952 bytes)", "redirected": false, "status": 200, "strict_transport_security": "max-age=15724800; includeSubDomains", "url": "https://api.ns.internet2.edu/v1/virtualnetworks/spaces/0a2469b0-a68d-4f97-b6c3-63450654c7df?consolidated=false"}
2024-01-09 22:54:16,044 - al2s-am-handler - {ansible_helper.py:235} - [TickEvent-89]- DEBUG - Unreachable:
2024-01-09 22:54:16,045 - al2s-am-handler - {al2s_handler.py:292} - [TickEvent-89]- ERROR - Exception occurred in cleanup b417b886-48f3-404c-bf89-838358239ec5 error: Playbook has failed tasks: {'virtualSpaceId': '0a2469b0-a68d-4f97-b6c3-63450654c7df', 'name': 11101, 'title': 'aws-6-296256999979', 'notes': '', 'created': {'at': '2024-01-09T22:49:44.768991+00:00', 'by': {'personCode': '0726-9d69-61e8-4c17-f9bd-f645-4f53-5cea-e4b1-6e8a-d515-2b73-f27e-c0ff-a311-3804', 'name': '(missing)', 'email': '(missing)'}}, 'modified': {'at': '2024-01-09T22:49:44.768991+00:00', 'by': {'personCode': '0726-9d69-61e8-4c17-f9bd-f645-4f53-5cea-e4b1-6e8a-d515-2b73-f27e-c0ff-a311-3804', 'name': '(missing)', 'email': '(missing)'}}, 'owner': {'orgName': 'FABRIC', 'orgPath': 'i2/netres/fabric', 'membershipCategory': 'Undefined'}, 'collaborators': [], 'gnocdbServiceId': None, 'virtualSwitchIds': [], 'virtualRouterIds': ['74fe0d81-2520-44bf-8dd0-6a8aa8998df1'], 'display': 'VNSPACE-11101', 'cloudConnections': {'aws': False, 'azure': False, 'gcp': False, 'oci': False}, 'ok': True, 'actions': ['vspace_add_collaborator', 'vspace_create', 'vspace_delete', 'vspace_remove_collaborator', 'vspace_retrieve', 'vspace_update']}
2024-01-09 22:54:16,045 - al2s-am-handler - {al2s_handler.py:293} - [TickEvent-89]- ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/fabric_am/handlers/al2s_handler.py", line 175, in create
    ansible_helper.run_playbook(playbook_path=playbook_path_full)
  File "/usr/local/lib/python3.11/site-packages/fabric_am/util/ansible_helper.py", line 229, in run_playbook
    raise PlaybookException(f"Playbook has failed tasks: {msg}")
fabric_am.util.ansible_helper.PlaybookException: Playbook has failed tasks: Status code was -1 and not [200]: Connection failure: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/fabric_am/handlers/al2s_handler.py", line 277, in __cleanup
    ansible_helper.run_playbook(playbook_path=playbook_path_full)
  File "/usr/local/lib/python3.11/site-packages/fabric_am/util/ansible_helper.py", line 229, in run_playbook
    raise PlaybookException(f"Playbook has failed tasks: {msg}")
fabric_am.util.ansible_helper.PlaybookException: Playbook has failed tasks: {'virtualSpaceId': '0a2469b0-a68d-4f97-b6c3-63450654c7df', 'name': 11101, 'title': 'aws-6-296256999979', 'notes': '', 'created': {'at': '2024-01-09T22:49:44.768991+00:00', 'by': {'personCode': '0726-9d69-61e8-4c17-f9bd-f645-4f53-5cea-e4b1-6e8a-d515-2b73-f27e-c0ff-a311-3804', 'name': '(missing)', 'email': '(missing)'}}, 'modified': {'at': '2024-01-09T22:49:44.768991+00:00', 'by': {'personCode': '0726-9d69-61e8-4c17-f9bd-f645-4f53-5cea-e4b1-6e8a-d515-2b73-f27e-c0ff-a311-3804', 'name': '(missing)', 'email': '(missing)'}}, 'owner': {'orgName': 'FABRIC', 'orgPath': 'i2/netres/fabric', 'membershipCategory': 'Undefined'}, 'collaborators': [], 'gnocdbServiceId': None, 'virtualSwitchIds': [], 'virtualRouterIds': ['74fe0d81-2520-44bf-8dd0-6a8aa8998df1'], 'display': 'VNSPACE-11101', 'cloudConnections': {'aws': False, 'azure': False, 'gcp': False, 'oci': False}, 'ok': True, 'actions': ['vspace_add_collaborator', 'vspace_create', 'vspace_delete', 'vspace_remove_collaborator', 'vspace_retrieve', 'vspace_update']}

2024-01-09 22:54:16,046 - al2s-am-handler - {al2s_handler.py:195} - [TickEvent-89]- ERROR - Playbook has failed tasks: {'virtualSpaceId': '0a2469b0-a68d-4f97-b6c3-63450654c7df', 'name': 11101, 'title': 'aws-6-296256999979', 'notes': '', 'created': {'at': '2024-01-09T22:49:44.768991+00:00', 'by': {'personCode': '0726-9d69-61e8-4c17-f9bd-f645-4f53-5cea-e4b1-6e8a-d515-2b73-f27e-c0ff-a311-3804', 'name': '(missing)', 'email': '(missing)'}}, 'modified': {'at': '2024-01-09T22:49:44.768991+00:00', 'by': {'personCode': '0726-9d69-61e8-4c17-f9bd-f645-4f53-5cea-e4b1-6e8a-d515-2b73-f27e-c0ff-a311-3804', 'name': '(missing)', 'email': '(missing)'}}, 'owner': {'orgName': 'FABRIC', 'orgPath': 'i2/netres/fabric', 'membershipCategory': 'Undefined'}, 'collaborators': [], 'gnocdbServiceId': None, 'virtualSwitchIds': [], 'virtualRouterIds': ['74fe0d81-2520-44bf-8dd0-6a8aa8998df1'], 'display': 'VNSPACE-11101', 'cloudConnections': {'aws': False, 'azure': False, 'gcp': False, 'oci': False}, 'ok': True, 'actions': ['vspace_add_collaborator', 'vspace_create', 'vspace_delete', 'vspace_remove_collaborator', 'vspace_retrieve', 'vspace_update']}
2024-01-09 22:54:16,046 - al2s-am-handler - {al2s_handler.py:197} - [TickEvent-89]- INFO - Tried to delete VM duo to failure
2024-01-09 22:54:16,046 - al2s-am-handler - {al2s_handler.py:213} - [TickEvent-89]- ERROR - Playbook has failed tasks: Status code was -1 and not [200]: Connection failure: The read operation timed out
2024-01-09 22:54:16,046 - al2s-am-handler - {al2s_handler.py:214} - [TickEvent-89]- ERROR - Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/fabric_am/handlers/al2s_handler.py", line 175, in create
    ansible_helper.run_playbook(playbook_path=playbook_path_full)
  File "/usr/local/lib/python3.11/site-packages/fabric_am/util/ansible_helper.py", line 229, in run_playbook
    raise PlaybookException(f"Playbook has failed tasks: {msg}")
fabric_am.util.ansible_helper.PlaybookException: Playbook has failed tasks: Status code was -1 and not [200]: Connection failure: The read operation timed out

2024-01-09 22:54:16,046 - al2s-am-handler - {al2s_handler.py:216} - [TickEvent-89]- INFO - Create completed

Solution: I2 pushed a fix and it works fine so far

A problem I encountered in creating muiltiple slices simutaneously. The portal shows aws-14 stableOK but aws-5 stableError. But the log on al2sam server shows both slices created ok. That problem appears randomly over multiple tries.

Refer to 6

2024-01-12 07:20:40,770 - al2s-am-handler - {handler_base.py:53} - [TickEvent-85]- DEBUG - Reading config file: /etc/fabric/actor/config/al2s_handler_config.yml
2024-01-12 07:20:40,778 - al2s-am-handler - {al2s_handler.py:174} - [TickEvent-85]- DEBUG - Executing playbook /etc/fabric/actor/playbooks/al2s.yml to create Network Service
2024-01-12 07:21:16,332 - al2s-am-handler - {ansible_helper.py:225} - [TickEvent-85]- DEBUG - Playbook result: 0
2024-01-12 07:21:16,333 - al2s-am-handler - {ansible_helper.py:231} - [TickEvent-85]- DEBUG - OK:
2024-01-12 07:21:16,333 - al2s-am-handler - {ansible_helper.py:145} - [TickEvent-85]- INFO - {
    "ansible_env.HOME": "/root",
    "changed": false
}
2024-01-12 07:21:16,333 - al2s-am-handler - {ansible_helper.py:233} - [TickEvent-85]- DEBUG - Failed:
2024-01-12 07:21:16,333 - al2s-am-handler - {ansible_helper.py:235} - [TickEvent-85]- DEBUG - Unreachable:
2024-01-12 07:21:16,333 - al2s-am-handler - {al2s_handler.py:187} - [TickEvent-85]- INFO - network service aws-5-296256999979 was committed ok but without change
2024-01-12 07:21:16,333 - al2s-am-handler - {al2s_handler.py:216} - [TickEvent-85]- INFO - Create completed
2024-01-12 07:21:20,949 - al2s-am-handler - {handler_base.py:53} - [TickEvent-85]- DEBUG - Reading config file: /etc/fabric/actor/config/al2s_handler_config.yml
2024-01-12 07:21:20,954 - al2s-am-handler - {al2s_handler.py:174} - [TickEvent-85]- DEBUG - Executing playbook /etc/fabric/actor/playbooks/al2s.yml to create Network Service
2024-01-12 07:21:50,827 - al2s-am-handler - {ansible_helper.py:225} - [TickEvent-85]- DEBUG - Playbook result: 0
2024-01-12 07:21:50,827 - al2s-am-handler - {ansible_helper.py:231} - [TickEvent-85]- DEBUG - OK:
2024-01-12 07:21:50,827 - al2s-am-handler - {ansible_helper.py:145} - [TickEvent-85]- INFO - {
    "ansible_env.HOME": "/root",
    "changed": false
}
2024-01-12 07:21:50,828 - al2s-am-handler - {ansible_helper.py:233} - [TickEvent-85]- DEBUG - Failed:
2024-01-12 07:21:50,828 - al2s-am-handler - {ansible_helper.py:235} - [TickEvent-85]- DEBUG - Unreachable:
2024-01-12 07:21:50,828 - al2s-am-handler - {al2s_handler.py:187} - [TickEvent-85]- INFO - network service aws-14-296256999979 was committed ok but without change
2024-01-12 07:21:50,828 - al2s-am-handler - {al2s_handler.py:216} - [TickEvent-85]- INFO - Create completed
2024-01-12 07:20:40,770 - al2s-am-handler - {handler_base.py:53} - [TickEvent-85]- DEBUG - Reading config file: /etc/fabric/actor/config/al2s_handler_config.yml
2024-01-12 07:20:40,778 - al2s-am-handler - {al2s_handler.py:174} - [TickEvent-85]- DEBUG - Executing playbook /etc/fabric/actor/playbooks/al2s.yml to create Network Service
2024-01-12 07:21:16,332 - al2s-am-handler - {ansible_helper.py:225} - [TickEvent-85]- DEBUG - Playbook result: 0
2024-01-12 07:21:16,333 - al2s-am-handler - {ansible_helper.py:231} - [TickEvent-85]- DEBUG - OK:
2024-01-12 07:21:16,333 - al2s-am-handler - {ansible_helper.py:145} - [TickEvent-85]- INFO - {
    "ansible_env.HOME": "/root",
    "changed": false
}
2024-01-12 07:21:16,333 - al2s-am-handler - {ansible_helper.py:233} - [TickEvent-85]- DEBUG - Failed:
2024-01-12 07:21:16,333 - al2s-am-handler - {ansible_helper.py:235} - [TickEvent-85]- DEBUG - Unreachable:
2024-01-12 07:21:16,333 - al2s-am-handler - {al2s_handler.py:187} - [TickEvent-85]- INFO - network service aws-5-296256999979 was committed ok but without change
2024-01-12 07:21:16,333 - al2s-am-handler - {al2s_handler.py:216} - [TickEvent-85]- INFO - Create completed
2024-01-12 07:21:20,949 - al2s-am-handler - {handler_base.py:53} - [TickEvent-85]- DEBUG - Reading config file: /etc/fabric/actor/config/al2s_handler_config.yml
2024-01-12 07:21:20,954 - al2s-am-handler - {al2s_handler.py:174} - [TickEvent-85]- DEBUG - Executing playbook /etc/fabric/actor/playbooks/al2s.yml to create Network Service
2024-01-12 07:21:50,827 - al2s-am-handler - {ansible_helper.py:225} - [TickEvent-85]- DEBUG - Playbook result: 0
2024-01-12 07:21:50,827 - al2s-am-handler - {ansible_helper.py:231} - [TickEvent-85]- DEBUG - OK:
2024-01-12 07:21:50,827 - al2s-am-handler - {ansible_helper.py:145} - [TickEvent-85]- INFO - {
    "ansible_env.HOME": "/root",
    "changed": false
}
2024-01-12 07:21:50,828 - al2s-am-handler - {ansible_helper.py:233} - [TickEvent-85]- DEBUG - Failed:
2024-01-12 07:21:50,828 - al2s-am-handler - {ansible_helper.py:235} - [TickEvent-85]- DEBUG - Unreachable:
2024-01-12 07:21:50,828 - al2s-am-handler - {al2s_handler.py:187} - [TickEvent-85]- INFO - network service aws-14-296256999979 was committed ok but without change
2024-01-12 07:21:50,828 - al2s-am-handler - {al2s_handler.py:216} - [TickEvent-85]- INFO - Create completed
zlion commented 6 months ago

The unresolved issues mentioned earlier do not appear in the latest tests. This could be attributed to changes in the code or the testing environment. Therefore, I propose closing this issue ticket for the time being.