Closed xi-yang closed 2 weeks ago
ssl_certificate_key: '/root/sense-o.es.net-ssl/sense-o.key' ssl_certificate: '/root/sense-o.es.net-ssl/sense-o.crt' grafana_host: 'http://dev2.virnao.com:3000' pushgateway: 'http://dev2.virnao.com:9091' grafana_username: 'admin' grafana_password: 'admin' grafana_api_token: "Bearer eyJrIjoiT05BSkJWakFmUkxCaDVadU0wYVhkdEdZc3ZBWng2bGEiLCJuIjoiZCIsImlkIjoxfQ==" siterm_url_map: "urn:ogf:network:nrp-nautilus.io:2020": https://sense-prpdev-fe.sdn-lb.ultralight.org/T2_US_SDSC/sitefe/json/frontend "urn:ogf:network:ultralight.org:2013": https://sense-caltech-fe.sdn-lb.ultralight.org/T2_US_Caltech_Test/sitefe/json/frontend "urn:ogf:network:sc-test.cenic.net:2020": https://sense-ladowntown-fe.sdn-lb.ultralight.org/NRM_CENIC/sitefe/json/frontend
{
"flow": "rtmon-4700d4e0-bb7d-4a30-9736-91fa5f2f1852 ",
"title": "RTMON",
"grafana_host": "http://dev2.virnao.com:3000",
"pushgateway": "http://dev2.virnao.com:9091",
"grafana_api_token": "Bearer eyJrIjoidDVlQ3N4U01BWDc3bE5RVjVlWkxYcnpwQUkyRWFsV1ciLCJuIjoiMTAvMTNfMTA6MDciLCJpZCI6MX0=",
"node": [
{
"name": "T2_US_Caltech_Test:dellos9_s0",
"type": "switch",
"runtime": 610,
"interface": [
{
"name": "hundredGigE_1-23",
"vlan": 3873,
"peer": [
{
"name": null,
"interface": null,
"vlan": null
}
]
},
{
"name": "hundredGigE_1-27",
"vlan": 3873,
"peer": [
{
"name": null,
"interface": null,
"vlan": null
}
]
}
]
}
]
}
@xi-yang @PannuMuthu @juztas The script is failing to create the manifest. I'm getting this response:
<html><head><title>Error</title></head><body>Internal Server Error</body></html>
When I'm running this lines:
workflowApi = WorkflowCombinedApi()
workflowApi.si_uuid = instance['referenceUUID']
response = workflowApi.manifest_create(json.dumps(template))
print(response)
Is this because the Caltech SENSE-RM is down? @xi-yang
@juztas
I'm getting this message when I'm doing a post req to that url
This is the data:
{
"hostname": "aristaeos_s0",
"hosttype": "switch",
"type": "prometheus-push",
"metadata": {
"instance": "NRM_CENIC:aristaeos_s0",
"flow": "rtmon-4700d4e0-bb7d-4a30-9736-91fa5f2f1852"
},
"gateway": "http://dev2.virnao.com:9091",
"runtime": "1697821974",
"resolution": "5"
}
See the error_description. your requested time for runtime (in epoch) is not within this range 600 > x > 3600 seconds. Idea is that we dont want to let this run endlessly - so we need limits. Reasonable increases are possible.
@xi-yang @PannuMuthu This 2 lines to run the script-exporter is not working:
os.system('yes | cp -rfa se_config/. script_exporter/examples')
os.system('yes | docker rm -f $(docker ps -a --format "{{.Names}}" | grep "script_exporter")')
All the running image:
[root@ip-172-31-72-189 cloud]# docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
d95177c98261 prom/pushgateway:latest "/bin/pushgateway" 2 weeks ago Exited (0) 36 hours ago cloud_pushgateway.1.ycs99cnaxby8qj4yu77if7s3t
fd3f8fd5187f prom/prometheus:v2.2.1 "/bin/prometheus --c…" 2 weeks ago Exited (0) 36 hours ago cloud_prometheus.1.on9ceva0uvgc2xtu6f37iat11
bf4927435a89 grafana/grafana-enterprise:latest "/run.sh" 3 weeks ago Exited (0) 36 hours ago cloud_grafana.1.o6f8ac6bqi1uwhcrwjt2vppe7
7154bfe60402 prom/pushgateway:latest "/bin/pushgateway" 3 weeks ago Exited (0) 36 hours ago cloud_pushgateway.1.d0jch6coxozqvxfx2vanunrcc
26a8e6608059 prom/prometheus:v2.2.1 "/bin/prometheus --c…" 3 weeks ago Exited (0) 36 hours ago cloud_prometheus.1.56h5wc5hc0fj0swi5winyhsd0
The output of docker ps -a shows that there are no containers with names that include "script_exporter". That is why it is failing.
@xi-yang @PannuMuthu
Resolution:
We identified an issue where data scraping from the pushgateway was attempted before actually pushing the data, leading to failures in data collection. This was rectified by reversing the operation order; now, data is first pushed to the pushgateway followed by a 20-second delay to account for the time required for data appearance. Post-delay, the l2debugging.sh
script successfully scrapes the required data.
print("Data Dispatched")
time.sleep(20)
os.system("python3 ./se_config/generate_script.py flow.yaml")
print("Scraping Script Exporter")
os.system("./l2debugging.sh")
Resolution: Initially, I observed that running a single flow for 12 hours consumed the entire 40 GB of storage space. To address this:
Prometheus Data Retention Adjustment:
I updated the Prometheus configuration to reduce the data retention period to 6 hours using the following settings in docker-stack.yml
:
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention=6h'
This change effectively reduced the storage footprint of Prometheus metrics.
Log File Rotation: Subsequent checks, after 36 hours running a single flow, pinpointed log files as major space consumers. To solve this, I implemented logging rotation with the following settings:
logging:
options:
max-size: "10m"
max-file: "3"
This ensures that each log file is capped at 10 MB, and only the three most recent files are retained, thus preventing excessive space utilization by logs.
Outcome: With these adjustments, running five flows simultaneously for three days resulted in a total space usage of only 2 GB out of the available 40 GB, demonstrating a significant improvement in resource management.
@sunami09 A few items to follow up for this week:
@sunami09 I created two service profiles for the Integration Tests 6 and 7. Note that I updated the test scenarios.
@xi-yang will provide service profiles for test scenarios 7 and 8.
@xi-yang will verify the manifest retrieval for different scenarios to make sure the manifest contains all the required hops.
i am closing this one too. Lets reopen new issues based on what is running right now on autogole monitoring. Most of these things were tested previously
To complete the first milestone for production deployment, we want to support the following scenarios with stable workflows.
ETC: 1 : by first week of November, 2023 2, 3, 4: by mid December, 2023 5, 6, 7: by end of January, 2024