esnet / sense-rtmon

Custom Scripts for Dynamic End-To-End Flow-Specific Grafana Dashboards
5 stars 4 forks source link

Cloud stack SENSE-O flow metadata exchange and handling #28

Closed xi-yang closed 1 year ago

xi-yang commented 1 year ago
xi-yang commented 1 year ago

@PannuMuthu A solution has been applied to the SENSE-O to provide per-flow metadata via the Discovery API and Service API endpoints. Example code is under cloud/orchestrator folder.

The fetch of metadata are two steps: I. Fetch list of instance metadata with a designated SENSE-O account (say sense-rtmon) using a Discovery API endpoint. SENSE-O will be responsible for mark the service instances eligible for monitoring to be visible for this account. II. Use Service API endpoint to retrieve each instance manifest using a custom template that will be filled with all the parameters required by the flow config.

For Task 2, we need to have a controller daemon to loop through the above steps. We can use either a cached retrieved list or the timestamp within the instance metadata to determine which instances are new to kick off a workflow.

xi-yang commented 1 year ago

A SENSE-O manifest query retrieves a JSON structure like the below. Due to limitation of the manifest templating, it is not organized exactly like the SENSE-RTMON config structure. But all the information is in there. So part of Task 2.1 is to translate this into the required format.

Also note that this example instance has only one host at one end. We want to accommodate the cases with 2, 1 and 0 hosts. How to handle the actual flow config with 0 or 1 host (like how to present them in dashboard) can be left to future tasks.

{
  "Ports": [
    {
      "Port": "urn:ogf:network:ultralight.org:2013:dellos9_s0:hundredGigE_1-10",
      "Node": "T2_US_Caltech_Test:dellos9_s0",
      "Peer": "?peer?",
      "Host": [
        {
          "IPv4": "10.251.87.10/24",
          "Interface": "mlx4p2s1",
          "Name": "T2_US_Caltech_Test:sandie-1.ultralight.org"
        }
      ],
      "Vlan": "3873",
      "Name": "hundredGigE 1/10"
    },
    {
      "Port": "urn:ogf:network:ultralight.org:2013:dellos9_s0:Port-channel_103",
      "Node": "T2_US_Caltech_Test:dellos9_s0",
      "Peer": "urn:ogf:network:sc-test.cenic.net:2020:aristaeos_s0:Port-Channel501",
      "Vlan": "3873",
      "Name": "Port-channel 103"
    },
    {
      "Port": "urn:ogf:network:nrp-nautilus.io:2020:sn3700_s0:Ethernet108",
      "Node": "T2_US_SDSC:sn3700_s0",
      "Peer": "?peer?",
      "Host": [
        {
          "IPv4": "10.251.87.11/24",
          "Interface": "enp65s0np0",
          "Name": "T2_US_SDSC:k8s-gen4-02.sdsc.optiputer.net"
        }
      ],
      "Vlan": "3873",
      "Name": "Ethernet108"
    },
    {
      "Port": "urn:ogf:network:nrp-nautilus.io:2020:sn3700_s0:PortChannel501",
      "Node": "T2_US_SDSC:sn3700_s0",
      "Peer": "urn:ogf:network:sc-test.cenic.net:2020:aristaeos_s0:Port-Channel502",
      "Vlan": "3873",
      "Name": "PortChannel501"
    },
    {
      "Port": "urn:ogf:network:sc-test.cenic.net:2020:aristaeos_s0:Port-Channel502",
      "Node": "NRM_CENIC:aristaeos_s0",
      "Peer": "urn:ogf:network:nrp-nautilus.io:2020:sn3700_s0:PortChannel501",
      "Vlan": "3873",
      "Name": "Port-Channel502"
    },
    {
      "Port": "urn:ogf:network:sc-test.cenic.net:2020:aristaeos_s0:Port-Channel501",
      "Node": "NRM_CENIC:aristaeos_s0",
      "Peer": "urn:ogf:network:ultralight.org:2013:dellos9_s0:Port-channel_103",
      "Vlan": "3873",
      "Name": "Port-Channel501"
    }
  ]
}
xi-yang commented 1 year ago

Justas made the required model change to the SiteRM. The manifest template has been updated to reflect the change. The above JSON output example has also been updated. The change includes:

The above is example is complete now. Note that any field in the ?xxx? format is a field that should be ignored.

xi-yang commented 1 year ago

@PannuMuthu Any update for the subtasks 2 and 3? A brief description of the design will be good.

In the meeting today, we asked to also include the service instance UUID in the breakdown per-device metadata to dispatch to the site RM.

PannuMuthu commented 1 year ago

Subtask 2: Create a cloud stack controller to fetch the SENSE-O metadata has been implemented and tested on sample SENSE-O manifests:

Subtask 3: Cloud stack workflow handling per-flow metadata is in progress but requires certain modifications/additions to the SENSE-O metadata manifest. Namely, the user would specify which flow each port in the manifest belongs to in order to perform per-flow SENSE-RTMON system-specific config translation.

Assuming this modification has been implemented in SENSE-O manifest creation, I have implemented per-flow SENSE-O --> SENSE-RTMON config translation in orchestratorConfigConvert.py. I tested per-flow translation of multiple flows in a single manifest on two sample testcases: multiFlowOrdered.json and multiFlowUnOrdered.json

xi-yang commented 1 year ago
  • 2.2. Persistent (file or DB) and in-memory data structure for processed metadata: Modified JSON Exporter created with configExporter.py stores SENSE-RTMON converted configs to hosted Prometheus metrics page and monitors config modifications in order to propagate changes to SiteRM.

How does hosted Prometheus metrics page become a data store for the converted per-flow per-site instrumentation configs? If the data has been generated, it is meant to dispatch to the site-RM. Why do we need to store it to Prometheus?

ZhenboYan commented 1 year ago

@PannuMuthu, I inspected the orchestrator to configuration file converter process. In the final configuration files, ping is no longer needed under each host. For all host type nodes, you need an arp: on by default to indicate whether to run an ARP exporter. Lastly, a runtime is needed for all nodes. The default could be runtime: 610.

In the general section: flow should be populated by the orchestrator. grafana_host, pushgateway, and grafana_api_token need to be populated in the conversion process.