xi-yang commented 8 months ago

To complete the first milestone for production deployment, we want to support the following scenarios with stable workflows.

[x] 1. Single flow with both source and destination hosts and zero to multiple intermediate switches.
- [x] 1a. Dashboard template queries need to refresh to get per flow data per node / port correctly.
- [x] 1b. L2 Debugging Panel Failing to Collect Data
[x] 2. Rapid Storage Consumption
[ ] 3. Flows with non-SENSE-RM middle nodes (no monitoring) @PannuMuthu #69
[x] 4. Multiple active flows with both source and destination hosts and zero to multiple intermediate switches. @sunami09 #65
[x] 5. Multiple flows with changing / dynamic states and termination. @sunami09 #67
[x] 6. Flows with IP on only source or destination host @sunami09 #90
[ ] 7. Flows with only source or destination host @sunami09
[ ] 8. Flows with neither host @sunami09

ETC: 1 : by first week of November, 2023 2, 3, 4: by mid December, 2023 5, 6, 7: by end of January, 2024

sunami09 commented 8 months ago

Clean Slate Environment

hostIP: 172.31.72.189

ssl_certificate: '/etc/pki/tls/certs/sense-mon_es_net_fullchain.cer' #(fullchain)

ssl_certificate_key: '/etc/pki/tls/private/sense-mon.key' #(prvikey)

ssl_certificate_key: '/root/sense-o.es.net-ssl/sense-o.key' ssl_certificate: '/root/sense-o.es.net-ssl/sense-o.crt' grafana_host: 'http://dev2.virnao.com:3000' pushgateway: 'http://dev2.virnao.com:9091' grafana_username: 'admin' grafana_password: 'admin' grafana_api_token: "Bearer eyJrIjoiT05BSkJWakFmUkxCaDVadU0wYVhkdEdZc3ZBWng2bGEiLCJuIjoiZCIsImlkIjoxfQ==" siterm_url_map: "urn:ogf:network:nrp-nautilus.io:2020": https://sense-prpdev-fe.sdn-lb.ultralight.org/T2_US_SDSC/sitefe/json/frontend "urn:ogf:network:ultralight.org:2013": https://sense-caltech-fe.sdn-lb.ultralight.org/T2_US_Caltech_Test/sitefe/json/frontend "urn:ogf:network:sc-test.cenic.net:2020": https://sense-ladowntown-fe.sdn-lb.ultralight.org/NRM_CENIC/sitefe/json/frontend

sunami09 commented 8 months ago

{
  "flow": "rtmon-4700d4e0-bb7d-4a30-9736-91fa5f2f1852 ",
  "title": "RTMON",
  "grafana_host": "http://dev2.virnao.com:3000",
  "pushgateway": "http://dev2.virnao.com:9091",
  "grafana_api_token": "Bearer eyJrIjoidDVlQ3N4U01BWDc3bE5RVjVlWkxYcnpwQUkyRWFsV1ciLCJuIjoiMTAvMTNfMTA6MDciLCJpZCI6MX0=",
  "node": [
    {
      "name": "T2_US_Caltech_Test:dellos9_s0",
      "type": "switch",
      "runtime": 610,
      "interface": [
        {
          "name": "hundredGigE_1-23",
          "vlan": 3873,
          "peer": [
            {
              "name": null,
              "interface": null,
              "vlan": null
            }
          ]
        },
        {
          "name": "hundredGigE_1-27",
          "vlan": 3873,
          "peer": [
            {
              "name": null,
              "interface": null,
              "vlan": null
            }
          ]
        }
      ]
    }
  ]
}

sunami09 commented 8 months ago

@xi-yang @PannuMuthu @juztas The script is failing to create the manifest. I'm getting this response:

<html><head><title>Error</title></head><body>Internal Server Error</body></html>

When I'm running this lines:

workflowApi = WorkflowCombinedApi()
workflowApi.si_uuid = instance['referenceUUID']
response = workflowApi.manifest_create(json.dumps(template))
print(response)

Is this because the Caltech SENSE-RM is down? @xi-yang

sunami09 commented 8 months ago

@juztas

I'm getting this message when I'm doing a post req to that url

This is the data:


    {
      "hostname": "aristaeos_s0",
      "hosttype": "switch",
      "type": "prometheus-push",
      "metadata": {
        "instance": "NRM_CENIC:aristaeos_s0",
        "flow": "rtmon-4700d4e0-bb7d-4a30-9736-91fa5f2f1852"
      },
      "gateway": "http://dev2.virnao.com:9091",
      "runtime": "1697821974",
      "resolution": "5"
    }

juztas commented 8 months ago

See the error_description. your requested time for runtime (in epoch) is not within this range 600 > x > 3600 seconds. Idea is that we dont want to let this run endlessly - so we need limits. Reasonable increases are possible.

sunami09 commented 8 months ago

@xi-yang @PannuMuthu This 2 lines to run the script-exporter is not working:

os.system('yes | cp -rfa se_config/. script_exporter/examples')
os.system('yes | docker rm -f $(docker ps -a --format "{{.Names}}" | grep "script_exporter")')

All the running image:

[root@ip-172-31-72-189 cloud]# docker ps -a
CONTAINER ID   IMAGE                               COMMAND                  CREATED       STATUS                    PORTS     NAMES
d95177c98261   prom/pushgateway:latest             "/bin/pushgateway"       2 weeks ago   Exited (0) 36 hours ago             cloud_pushgateway.1.ycs99cnaxby8qj4yu77if7s3t
fd3f8fd5187f   prom/prometheus:v2.2.1              "/bin/prometheus --c…"   2 weeks ago   Exited (0) 36 hours ago             cloud_prometheus.1.on9ceva0uvgc2xtu6f37iat11
bf4927435a89   grafana/grafana-enterprise:latest   "/run.sh"                3 weeks ago   Exited (0) 36 hours ago             cloud_grafana.1.o6f8ac6bqi1uwhcrwjt2vppe7
7154bfe60402   prom/pushgateway:latest             "/bin/pushgateway"       3 weeks ago   Exited (0) 36 hours ago             cloud_pushgateway.1.d0jch6coxozqvxfx2vanunrcc
26a8e6608059   prom/prometheus:v2.2.1              "/bin/prometheus --c…"   3 weeks ago   Exited (0) 36 hours ago             cloud_prometheus.1.56h5wc5hc0fj0swi5winyhsd0

The output of docker ps -a shows that there are no containers with names that include "script_exporter". That is why it is failing.

sunami09 commented 7 months ago

@xi-yang @PannuMuthu

Fixes for L2 Debugging Panel and Storage Issues

Issue 1: L2 Debugging Panel Failing to Collect Data

Resolution: We identified an issue where data scraping from the pushgateway was attempted before actually pushing the data, leading to failures in data collection. This was rectified by reversing the operation order; now, data is first pushed to the pushgateway followed by a 20-second delay to account for the time required for data appearance. Post-delay, the l2debugging.sh script successfully scrapes the required data.

print("Data Dispatched")
time.sleep(20)
os.system("python3 ./se_config/generate_script.py flow.yaml")
print("Scraping Script Exporter")
os.system("./l2debugging.sh")

Issue 2: Rapid Storage Consumption

Resolution: Initially, I observed that running a single flow for 12 hours consumed the entire 40 GB of storage space. To address this:

Prometheus Data Retention Adjustment: I updated the Prometheus configuration to reduce the data retention period to 6 hours using the following settings in docker-stack.yml:
```
command:
 - '--config.file=/etc/prometheus/prometheus.yml'
 - '--storage.tsdb.retention=6h'  
```
This change effectively reduced the storage footprint of Prometheus metrics.
Log File Rotation: Subsequent checks, after 36 hours running a single flow, pinpointed log files as major space consumers. To solve this, I implemented logging rotation with the following settings:
```
logging:
 options:
   max-size: "10m"
   max-file: "3"
```
This ensures that each log file is capped at 10 MB, and only the three most recent files are retained, thus preventing excessive space utilization by logs.

Outcome: With these adjustments, running five flows simultaneously for three days resulted in a total space usage of only 2 GB out of the available 40 GB, demonstrating a significant improvement in resource management.

sunami09 commented 7 months ago

We need run the l2debugging every cycle (3 min).
setup a meeting with @xi-yang for iperf. Preferably tomorrow.
Updating the genrate_script.py -> for switches

xi-yang commented 6 months ago

@sunami09 A few items to follow up for this week:

Clean up and remove old code branches.
Test the workflow branch with an all-container setup
Update the READNE document and ask others, say @abessiari @xi-yang, to try it

xi-yang commented 5 months ago

@sunami09 I created two service profiles for the Integration Tests 6 and 7. Note that I updated the test scenarios.

IT6 still have both hosts but one end has no IP assigned.
IT7 has one host and the other end is switch.

xi-yang commented 3 months ago

@xi-yang will provide service profiles for test scenarios 7 and 8.

xi-yang commented 2 months ago

@xi-yang will verify the manifest retrieval for different scenarios to make sure the manifest contains all the required hops.

juztas commented 2 weeks ago

i am closing this one too. Lets reopen new issues based on what is running right now on autogole monitoring. Most of these things were tested previously

esnet / sense-rtmon

Integration tests for multi-site multi-flow scenarios #51

Clean Slate Environment

hostIP: 172.31.72.189

ssl_certificate: '/etc/pki/tls/certs/sense-mon_es_net_fullchain.cer' #(fullchain)

ssl_certificate_key: '/etc/pki/tls/private/sense-mon.key' #(prvikey)

Fixes for L2 Debugging Panel and Storage Issues

Issue 1: L2 Debugging Panel Failing to Collect Data

Issue 2: Rapid Storage Consumption