fabric-testbed / fabfed

FABRIC Tool-based Federation Kit for a Testbed of Testbeds
MIT License
2 stars 0 forks source link

SENSE provider GCP case #29

Closed xi-yang closed 1 year ago

xi-yang commented 1 year ago

The SENSE provider in develop branch fully works for L2 DTN and AWS cases. With Liang finishing the GCPDriver refresh and support for Interconnect service, we will ad the GCP case to the provider.

ETC: end of April

xi-yang commented 1 year ago

@zlion A feature-sense_gcp_stitching-liang branch has been created for this development. I think the SENSE-O GCPDriver is usable for this already. While you can continue fixing a few minor things there, you can start developing its fabfed integration.

Try use localhost deployment for the SENSE provider. First step is to install fabfed with necessary provider configs and exercise the SENSE AWS stitching workflow.

abessiari commented 1 year ago

@xi-yang cc: @zlion

I understand sense (GCP) is a producer and fabric is a consumer. What information do we need on the fabric side? I mean is everything setup and do we follow the same convention as we did for AWS.

Recall for AWS , to create the facility port: cloud=AWS ====> name=Cloud_Facility_AWS and site=AWS device_name=agg4.ashb.net.internet2.edu local_name=HundredGigE0/0/0/7

And we also need the following for the peer labels the asn and the account_id. The bgp_key can be hardcoded. Thanks

abessiari commented 1 year ago

@xi-yang cc @zlion

Hi Xi, Can you share that sense service profile (GCP) that Liang is using? Thanks. aes

zlion commented 1 year ago

Here is the profile I am using in my local orchestrator.

{
  "data": {
    "parent": "urn:ogf:network:google.com:gcp-cloud",
    "gateways": [
      {
        "name": "Gateway 1",
        "connects": [
          {
            "authkey": "0xzsEwC7xk6c1fK_h.xHyAdx",
            "cloud_ip": "192.168.30.2/24",
            "customer_ip": "192.168.30.1/24",
            "customer_asn": "55038"
          }
        ],
        "type": "GCP Interconnect"
      }
    ],
    "cidr": "10.100.0.0/16",
    "subnets": [
      {
        "vpn_route_propagation": true,
        "name": "Subnet 1",
        "cidr": "10.100.0.0/24",
        "vms": [
          {
            "interfaces": [
              {
                "public": true,
                "type": "Ethernet"
              }
            ],
            "name": "VM-1"
          }
        ],
        "internet_routable": true
      }
    ]
  },
  "service": "vcn",
  "options": [
    "gcp-form"
  ]
}
zlion commented 1 year ago

Reached out to Paul for an signup issue with "https://beta-4.fabric-testbed.net".

xi-yang commented 1 year ago

Please stay on top of these issues as a developer for both the SENSE/GCP interconnect and the AL2S AM.

zlion commented 1 year ago

@abessiari Created the service profile named "GCP-INTERCONN" in sense-o-dev web portal. GCP driver is added as well.

@xi-yang The sense-o-dev is using older version and the service instance creation succeeded without running GCPinterconnectStitching code.

Screen Shot 2023-05-25 at 4 00 20 PM
zlion commented 1 year ago

Reached out to Paul for an signup issue with "https://beta-4.fabric-testbed.net".

The process failed because Cilogon has new updates and broke the existing workflow. RENCI made a patch for me to get around that problem and get me complete the signup process. It is now pending for them to approve the signup.

xi-yang commented 1 year ago

@xi-yang The sense-o-dev is using older version and the service instance creation succeeded without running GCPinterconnectStitching code.

I will deploy the latest code to sense-o-dev on Friday. For now let's focus on fixing things on FABRIC end such as pairing key.

zlion commented 1 year ago
  • @zlion I think @abessiari is asking for the service profile name on sense-o-dev. You should create one, test it and assign to him.
  • Also SENSE/GCP as a producer will need to pass the pairingKey to FABRIC. @zlion That is why I asked you to create an example config for the workflow and put that into the repository. Also, if the FABRIC AL2S AM has not supported input of paring-key yet, add that support and work with Komal to update the fablib interface.

Please stay on top of these issues as a developer for both the SENSE/GCP interconnect and the AL2S AM.

Working on the AMHandlers to setup the AL2S connection for GCP

To summarize, the AMhandler code remains unchanged to support GCP, except for the placement of the pairing key in the "account_id" field in the GCP scenario.

zlion commented 1 year ago

Updates on Fabfed:

xi-yang commented 1 year ago

@xi-yang will generate a gcp-template.json file to get a manifest from SENSE for the stitching. pairing-key will be the parameter will need for the stitching.

xi-yang commented 1 year ago

The template file has been merged into dev-gcp-stitch.

Branch feature-sense_gcp_stitching-liang deleted.

xi-yang commented 1 year ago

Next work:

  1. Work with Komal to add pairing-key label into AL2S sliver.
  2. Update and deploy the AL2S AM with support for the pairing-key
  3. Add pairing-key based stitching to the FABRIC provider in fabfed.
  4. Regression tests for both SENSE/AWS and SENSE/GCP workflows.
zlion commented 1 year ago

Next work:

  1. Work with Komal to add pairing-key label into AL2S sliver.
  2. Update and deploy the AL2S AM with support for the pairing-key
  3. Add pairing-key based stitching to the FABRIC provider in fabfed.
  4. Regression tests for both SENSE/AWS and SENSE/GCP workflows.

1&2: The OESS API reuses the field "account_id" for pairing-key in the GCP case. So there is no change in the AL2S AM code but passing the pairing-key as the "account_id" value when calling the handler.

zlion commented 1 year ago
  1. There are some issues with "fablib", which has been reported to RENCI. For example,
2023-06-09 23:22:25,283 [fabric_slice.py:183] [INFO] Submitting request for slice test-gcp
2023-06-09 23:22:25,660 [slice.py:1905] [ERROR] Submit request error: return_status Status.FAILURE, slice_reservations: (415)
Reason: UNSUPPORTED MEDIA TYPE
HTTP response headers: HTTPHeaderDict({'Server': 'nginx/1.19.8', 'Date': 'Sat, 10 Jun 2023 04:22:25 GMT', 'Content-Type': 'application/problem+json', 'Content-Length': '151', 'Connection': 'keep-alive'})
HTTP response body: b'{\n  "detail": "Invalid Content-type (text/plain), expected JSON data",\n  "status": 415,\n  "title": "Unsupported Media Type",\n  "type": "about:blank"\n}\n'

2023-06-09 23:22:25,661 [controller.py:142] [ERROR] Submit request error: return_status Status.FAILURE, slice_reservations: (415)
Reason: UNSUPPORTED MEDIA TYPE
HTTP response headers: HTTPHeaderDict({'Server': 'nginx/1.19.8', 'Date': 'Sat, 10 Jun 2023 04:22:25 GMT', 'Content-Type': 'application/problem+json', 'Content-Length': '151', 'Connection': 'keep-alive'})
HTTP response body: b'{\n  "detail": "Invalid Content-type (text/plain), expected JSON data",\n  "status": 415,\n  "title": "Unsupported Media Type",\n  "type": "about:blank"\n}\n'
Traceback (most recent call last):
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/controller/controller.py", line 139, in create
    provider.create_resource(resource=resource.attributes)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 160, in create_resource
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 157, in create_resource
    self.do_create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_provider.py", line 54, in do_create_resource
    self.slice.create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 281, in create_resource
    self._submit_and_wait()
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 198, in _submit_and_wait
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 184, in _submit_and_wait
    slice_id = self.slice_object.submit(wait=False)
  File "/Users/lzhang9/opt/anaconda3/envs/fabfed/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py", line 1908, in submit
    raise Exception(
Exception: Submit request error: return_status Status.FAILURE, slice_reservations: (415)
Reason: UNSUPPORTED MEDIA TYPE
HTTP response headers: HTTPHeaderDict({'Server': 'nginx/1.19.8', 'Date': 'Sat, 10 Jun 2023 04:22:25 GMT', 'Content-Type': 'application/problem+json', 'Content-Length': '151', 'Connection': 'keep-alive'})
HTTP response body: b'{\n  "detail": "Invalid Content-type (text/plain), expected JSON data",\n  "status": 415,\n  "title": "Unsupported Media Type",\n  "type": "about:blank"\n}\n'
zlion commented 1 year ago

Komal suggested to change the version fabrictestbed-extensions==1.5.0, and that helps the slice submission 2023-06-14 09:35:48,188 [slice.py:1896] [INFO] Submit request success: return_status Status.OK, slice_reservations

zlion commented 1 year ago

See an exception of the fabfed code.

023-06-14 09:36:19,867 [controller.py:142] [ERROR] Slice Exception: Slice Name: test-gcp, Slice ID: dfa95458-2f85-4d54-9971-39539d83f2e2: Slice Exception: Slice Name: test-gcp, Slice ID: dfa95458-2f85-4d54-9971-39539d83f2e2: list index out of range#
Slice Exception: Slice Name: test-gcp, Slice ID: dfa95458-2f85-4d54-9971-39539d83f2e2: Slice Exception: Slice Name: test-gcp, Slice ID: dfa95458-2f85-4d54-9971-39539d83f2e2: list index out of range#

Traceback (most recent call last):
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/controller/controller.py", line 139, in create
    provider.create_resource(resource=resource.attributes)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 160, in create_resource
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/api/provider.py", line 157, in create_resource
    self.do_create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_provider.py", line 54, in do_create_resource
    self.slice.create_resource(resource=resource)
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 281, in create_resource
    self._submit_and_wait()
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 198, in _submit_and_wait
    raise e
  File "/Users/lzhang9/Projects/fabric-testbed/fabfed/fabfed/provider/fabric/fabric_slice.py", line 186, in _submit_and_wait
    self.slice_object.wait(progress=True)
  File "/Users/lzhang9/opt/anaconda3/envs/fabfed/lib/python3.9/site-packages/fabrictestbed_extensions/fablib/slice.py", line 1444, in wait
    raise Exception(str(exception_string))
Exception: Slice Exception: Slice Name: test-gcp, Slice ID: dfa95458-2f85-4d54-9971-39539d83f2e2: Slice Exception: Slice Name: test-gcp, Slice ID: dfa95458-2f85-4d54-9971-39539d83f2e2: list index out of range#
Slice Exception: Slice Name: test-gcp, Slice ID: dfa95458-2f85-4d54-9971-39539d83f2e2: Slice Exception: Slice Name: test-gcp, Slice ID: dfa95458-2f85-4d54-9971-39539d83f2e2: list index out of range#
zlion commented 1 year ago

Komal reported as follows,

I attempted to create a slice by passing device_name='agg4.ashb.net.internet2.edu', region='us-east-4' and I see error from the AL2S AM handler.

failed lease update- all units failed priming: Exception during create for unit: 4e105d87-219d-420e-b880-327bb5faff93 (PlaybookException(Playbook has failed tasks: results: None, error_text: Invalid value for field region: . Must be a match of regex [a-z](?:[-a-z0-9]0,61[a-z0-9])? at /usr/share/perl5/vendor_perl/OESS/Cloud/GCP.pm line 391.n, error: 1), [(core1.loui.net.internet2.edu, HundredGigE0/0/0/24), (agg4.ashb.net.internet2.edu, Bundle-Ether110)])#all units failed priming: Exception during create for unit: 4e105d87-219d-420e-b880-327bb5faff93 (PlaybookException(Playbook has failed tasks: results: None, error_text: Invalid value for field region: . Must be a match of regex [a-z](?:[-a-z0-9]0,61[a-z0-9])? at /usr/share/perl5/vendor_perl/OESS/Cloud/GCP.pm line 391.n, error: 1), [(core1.loui.net.internet2.edu, HundredGigE0/0/0/24), (agg4.ashb.net.internet2.edu, Bundle-Ether110)])#

zlion commented 1 year ago

Looking into Komal's notebook "create_al2s.ipynb", I find it should contain "vlan" to get successful result.

labels=Labels(ipv4_subnet='192.168.30.1/24', device_name='agg4.ashb.net.internet2.edu', local_name='Bundle-Ether5', vlan='3'),

abessiari commented 1 year ago

For aws, we did not need to pass a vlan or a region. They were optional since day 1. And even today I was able to provision successfully ...

zlion commented 1 year ago

According to OESS document (https://globalnoc.github.io/OESS/api/vrf), the VLAN tag is required parameter. Per Komal, the Fabric-CF automatically pick one if not present. But with the notebook "create_al2s.ipynb", I successfully provision the al2s with the VLAN parameter. I need to do some investigation into the code.

xi-yang commented 1 year ago

Yes in the GCP case, FABRIC is the consumer end. It will need a specific VLAN tag from SENSE / GCP. We will need to extract that out of the SENSE model using the manifest template.

According to OESS document (https://globalnoc.github.io/OESS/api/vrf), the VLAN tag is required parameter. Per Komal, the Fabric-CF automatically pick one if not present. But with the notebook "create_al2s.ipynb", I successfully provision the al2s with the VLAN parameter. I need to do some investigation into the code.

zlion commented 1 year ago

Komal verified that the VLAN is needed when calling Fabric-CF. She's checking the code for the reason.

zlion commented 1 year ago

Here is the latest update from Komal.

okay, i debugged this and found that it's not a CF bug, but specifically VLAN=2 seems to always fail. Automatic vlan allocation via CF code works for all other values of vlan. Should the vlan range in the AL2S.graphml be updated to not include vlan=2 in the range?

@xi-yang Could you update that AL2S.graphml accordingly?

xi-yang commented 1 year ago

With a second look at how the VLAN attachment works, the VLAN was not provided by GCP but automatically by the partner network config. In this case, request to AL2S sliver can just be any vlan (not specified) whatever picked by the FABRIC-CF will be accepted by the GCP. Only the pairing key matters.

So no change for the manifest template.

Yes in the GCP case, FABRIC is the consumer end. It will need a specific VLAN tag from SENSE / GCP. We will need to extract that out of the SENSE model using the manifest template.

According to OESS document (https://globalnoc.github.io/OESS/api/vrf), the VLAN tag is required parameter. Per Komal, the Fabric-CF automatically pick one if not present. But with the notebook "create_al2s.ipynb", I successfully provision the al2s with the VLAN parameter. I need to do some investigation into the code.

A

xi-yang commented 1 year ago

Here is the latest update from Komal.

okay, i debugged this and found that it's not a CF bug, but specifically VLAN=2 seems to always fail. Automatic vlan allocation via CF code works for all other values of vlan. Should the vlan range in the AL2S.graphml be updated to not include vlan=2 in the range?

@xi-yang Could you update that AL2S.graphml accordingly?

I manually removed VLAN 2 from all ports. This is a temporary solution. We need to investigate why VLAN 2 did not work as it is in the range I2 gave us.

@zlion When doing the manual editing, I noticed some weird VLAN range strings like ["1-4095", "2-2"] . Since you did the API part of the OESS scanner, can you take a look at that?

zlion commented 1 year ago

Some caveats in the testing is the configuration of Fabric slice, which take me long time to figure out.

zlion commented 1 year ago

@xi-yang

On the fabric node, we are not able to add a route to the GCP network. See the error message below.

[rocky@ad442f3f-a3c1-4c87-86e7-d5ae862e8f97-fabric-node0 ~]$ sudo ip route add 10.200.0.0/16 via 192.168.10.1
Error: Nexthop has invalid gateway.

We are also see the status in the AL2S diagram that the red arrow. Screen Shot 2023-06-20 at 2 07 05 PM

xi-yang commented 1 year ago

The data interface on the VM must be configured with an address before you can add a route.

ip addr add 192.168.10.2/24 dev eth3
ip route add 10.200.0.0/16 via 192.168.10.1

I can ping 192.168.10.1 but I cannot ping 10.200.1.2 if test-gcp-gcp-net is the SENSE service instance for the stitched GCP resources.

abessiari commented 1 year ago

@xi-yang @zlion FYI: We used dev eth1 instead before attempting to add the route.

ip addr add 192.168.10.2/24 dev eth3

zlion commented 1 year ago

@xi-yang @abessiari

It seems to me the route is not setup right along the path to GCP. Sent inquiry to Internet2 engineers to check if they can get more insights. Please also advise ways to debug this case.

[rocky@ad442f3f-a3c1-4c87-86e7-d5ae862e8f97-fabric-node0 ~]$ traceroute 10.200.1.2 traceroute to 10.200.1.2 (10.200.1.2), 30 hops max, 60 byte packets 1 192.168.10.1 (192.168.10.1) 0.816 ms !N

xi-yang commented 1 year ago

The workflow works. Some small issues will be tracked in #49