F5Networks / f5-telemetry-streaming

F5 BIG-IP Telemetry Streaming
Apache License 2.0
53 stars 24 forks source link

Telemetry Streaming timing out due to probably large number of objects #254

Closed adnanenglish closed 1 year ago

adnanenglish commented 1 year ago

My name is Adnan. Please look me up in Teams if you work at F5. Please ping me if you want access to my lab.

GTM version 15.1.8.2 has 4K Wide IPs and 5,362 Pools. I loaded the config into my lab box

https://10.155.4.0 admin/admin root/root

I am using f5-telemetry-1.33.0-1.noarch.rpm

I was able to fetch via Pull Consumer status of each Wide IP and also filter to fetch only relevant critical field. However, no matter what I tried to do with Pools it will timeout or fail.

This is the CPU info incase you want to see:

config # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 2 Socket(s): 2 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7742 64-Core Processor Stepping: 0 CPU MHz: 2245.780 BogoMIPS: 4491.56 Hypervisor vendor: VMware Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 262144K NUMA node0 CPU(s): 0-3

From POSTMAN, I tried the following: 1) I use this POST to generate the token, then I copy the value and I add it as a new header (X-F5-Auth-Token) in my subsequent requests. https://10.155.4.0/mgmt/shared/authn/login

{ "username":"admin", "password":"admin", "loginProviderName":"tmos" }

2) I send this POST requests https://10.155.4.0/mgmt/shared/telemetry/declare

{ "class": "Telemetry", "My_System": { "class": "Telemetry_System", "enable": "true", "systemPoller": [ "DNS_Poller" ] }, "DNS_Poller": { "class": "Telemetry_System_Poller", "interval": 0, "actions": [ { "includeData": {}, "locations": { "aPools": { "/Common/.": { "enabledState": true, "status.statusReason": true, "availabilityState": true, "enabled": true, "name": true, "members": { ".": { "serverName": true, "enabledState": true, "status.statusReason": true, "vsName": true, "name": true, "enabled": true, "memberOrder": true

                           }
                        }
                    }
                }}}]},
"DNS_Pull_Consumer": {
    "class": "Telemetry_Pull_Consumer",
                    "type": "default",
                    "systemPoller": [
                        "DNS_Poller"
                    ]
                }
            }

We get a success message after we declare it.

{ "message": "success", "declaration": { "class": "Telemetry", "My_System": { "class": "Telemetry_System", "enable": true, "systemPoller": [ "DNS_Poller" ], "host": "localhost", "port": 8100, "protocol": "http", "allowSelfSignedCert": false }, "DNS_Poller": { "class": "Telemetry_System_Poller", "interval": 0, "actions": [ { "includeData": {}, "locations": { "aPools": { "/Common/.": { "enabledState": true, "status.statusReason": true, "availabilityState": true, "enabled": true, "name": true, "members": { ".": { "serverName": true, "enabledState": true, "status.statusReason": true, "vsName": true, "name": true, "enabled": true, "memberOrder": true } } } } }, "enable": true } ], "host": "localhost", "port": 8100, "protocol": "http", "allowSelfSignedCert": false, "enable": true }, "DNS_Pull_Consumer": { "class": "Telemetry_Pull_Consumer", "type": "default", "systemPoller": [ "DNS_Poller" ], "enable": true, "trace": false }, "schemaVersion": "1.33.0" } }

Now we try to pull the info so that we can view it. I do that by running the following GET requests

https://10.155.4.0/mgmt/shared/telemetry/pullconsumer/DNS_Pull_Consumer

I received timeouts after 60 seconds, so I modified the following values, I saved the config, and restarted services..

tmsh modify sys db icrd.timeout value 600

tmsh modify sys db restjavad.timeout value 600

tmsh modify sys db restnoded.timeout value 600

I tested it again, and now I get a new error after 5 minutes "502 proxy error".

The 502 Proxy Error is probably returned by the Apache webserver that proxies requests to the /mgmt/ endpoints. By default, the Apache proxy will flag the upstream as receiving an invalid response, and return the 502 Proxy error after 5 minutes. I changed Timeouts from httpd and removed the 502 error.

Now, I get 503 connection closed after about 8 and half minutes.

bigip=10.155.4.0;token=curl -sk -X POST -H "Content-Type: application/json" -d '{"username":"admin", "password":"admin", "loginProvideName": "tmos"}' "https://${bigip}/mgmt/shared/authn/login" | jq -r '.token.token'; time curl -sk -H "X-F5-Auth-Token: ${token}" https://10.155.4.0/mgmt/shared/telemetry/pullconsumer/DNS_Pull_Consumer

{"code":503,"message":"Connection closed","referer":"10.155.4.0","restOperationId":342012,"kind":":resterrorresponse"} real 8m25.092s user 0m0.062s sys 0m0.009s

In addition to all the above, here is a couple examples of declaration we tried and failed:

1) Failed { "class": "Telemetry", "My_System": { "class": "Telemetry_System", "enable": "true", "systemPoller": ["My_VS_Poller"] }, "My_VS_Poller": { "class": "Telemetry_System_Poller", "interval": 0, "actions": [

    {
        "includeData": {},
         "locations": {
            "aPools": {
                "/Common/z.*": {
                "status.availabilityState": true
                }
            }
    } }         
       ]
},
"My_Pull_Consumer": {
    "class": "Telemetry_Pull_Consumer",
    "type": "default",
    "systemPoller": ["My_VS_Poller"]
}

}

2) Failed

"My_VS_Poller": { "class": "Telemetry_System_Poller", "interval": 0, "actions": [ { "excludeData": {}, "locations": { "system": true } }, { "includeData": {}, "locations": { "aPools": { ".*": { "status.availabilityState": true } } } }

        ]
},

3) WideIPs managed to get status: 320ms Size: 1.33KB "My_VS_Poller": { "class": "Telemetry_System_Poller", "interval": 0, "actions": [ { "excludeData": {}, "locations": { "system": true } }, { "includeData": {}, "locations": { "aWideIps": { "/Common/z.*": { "status.availabilityState": true } } } }

        ]
},

4) Fetch status of one specific pool “/Common/AIM” failed – Postman connection reset – no reponse.

"My_VS_Poller": {
    "class": "Telemetry_System_Poller",
    "interval": 0,
    "actions": [
        {
        "excludeData": {},
        "locations": {
            "system": true
        }
    },
    {
        "includeData": {},
         "locations": {
            "aPools": {
                "/Common/AIM": {
                "availabilityState": true
                }
            }
    }
    }

        ]
},

5) Tried to add pre-processing filters to https://clouddocs.f5.com/products/extensions/f5-telemetry-streaming/latest/data-modification.html#pre-optimization-system-poller-only Tried to filter out only aPools starts with “A” (about 20 pools) but still failed. "My_VS_Poller": { "class": "Telemetry_System_Poller", "interval": 0, "actions": [

    {
        "includeData": {},
         "ifAnyMatch": [
             {
                 "aPools": {
                     "/Common/A*":{
                         "enabled": true
                     }
                 }
             }
         ],
         "locations": {
            "aPools": {
                "/Common/A*": {
                "availabilityState": true
                }
            }
    }
    }

        ]
},

Questions:

1- Is there any way we can setup the DNS_Pull_Consumer to apply to all (.*) pools, but can we craft a GET declaration for only 3-4 pools status and NOT everything?

2- If there is a limitation in the number of config between BIGIP and TS, can you please specify what that limitation is?

G-gonzalezjimenez commented 1 year ago

SR open tracking internally