F5Networks / f5-telemetry-streaming

F5 BIG-IP Telemetry Streaming
Apache License 2.0
53 stars 24 forks source link

Restnoded high CPU and Memory when handling ~700 http request per second Request logging. #233

Open eos500n2000 opened 1 year ago

eos500n2000 commented 1 year ago

Environment

Summary

Restnoded high CPU and Memory when handling ~700 http request per second Request logging.

Steps To Reproduce

Steps to reproduce the behavior:

  1. Submit the following declaration:
    [root@HQF5LB02:Active:Disconnected (Trust Domain Only)] config # restcurl -u admin:admin /mgmt/shared/telemetry/declare
    {
    "message": "success",
    "declaration": {
    "class": "Telemetry",
    "controls": {
      "class": "Controls",
      "logLevel": "debug",
      "debug": true,
      "memoryThresholdPercent": 90
    },
    "My_System": {
      "class": "Telemetry_System",
      "systemPoller": {
        "interval": 60,
        "enable": true,
        "actions": [
          {
            "setTag": {
              "tenant": "`T`",
              "application": "`A`"
            },
            "enable": true
          }
        ]
      },
      "enable": true,
      "host": "localhost",
      "port": 8100,
      "protocol": "http",
      "allowSelfSignedCert": false
    },
    "My_Listener": {
      "class": "Telemetry_Listener",
      "port": 6514,
      "enable": true,
      "trace": false,
      "match": "",
      "actions": [
        {
          "setTag": {
            "tenant": "`T`",
            "application": "`A`"
          },
          "enable": true
        }
      ]
    },
    "Httptest_Consumer": {
      "class": "Telemetry_Consumer",
      "type": "Generic_HTTP",
      "host": "10.20.4.176",
      "protocol": "http",
      "port": 8001,
      "enable": true,
      "trace": true,
      "path": "/",
      "allowSelfSignedCert": false,
      "method": "POST",
      "outputMode": "processed",
      "compressionType": "none"
    },
    "schemaVersion": "1.32.0"
    }
    }

[VS and Request log Config]

ltm pool /Common/telemetry { members { /Common/255.255.255.254:6514 { address 255.255.255.254 } } } ltm rule /Common/telemetry_local_rule { when CLIENT_ACCEPTED { node 127.0.0.1 6514 } }

ltm virtual /Common/telemetry_local { creation-time 2022-09-28:14:16:12 destination /Common/255.255.255.254:6514 ip-protocol tcp last-modified-time 2022-09-28:14:16:12 mask 255.255.255.255 persist { /Common/source_addr { default yes } } profiles { /Common/f5-tcp-progressive { } } rules { /Common/telemetry_local_rule } serverssl-use-sni disabled source 0.0.0.0/0 source-address-translation { type automap } translate-address enabled translate-port enabled }

ltm profile request-log /Common/telemetry { app-service none defaults-from /Common/request-log request-log-pool /Common/telemetry request-log-protocol mds-tcp request-log-template event_source=\"request_logging\",hostname=\"$BIGIP_HOSTNAME\",client_ip=\"$CLIENT_IP\",server_ip=\"$SERVER_IP\",http_method=\"$HTTP_METHOD\",http_uri=\"$HTTP_URI\",virtual_name=\"$VIRTUAL_NAME\",event_timestamp=\"$DATE_HTTP\" request-logging enabled response-log-pool /Common/telemetry response-log-protocol mds-tcp response-log-template event_source=\"response_logging\",hostname=\"$BIGIP_HOSTNAME\",client_ip=\"$CLIENT_IP\",server_ip=\"$SERVER_IP\",http_method=\"$HTTP_METHOD\",http_uri=\"$HTTP_URI\",virtual_name=\"$VIRTUAL_NAME\",event_timestamp=\"$DATE_HTTP\",http_statcode=\"$HTTP_STATCODE\",http_status=\"$HTTP_STATUS\",response_ms=\"$RESPONSE_MSECS\" response-logging enabled }

[http traffic test VS and pool]

ltm virtual /Common/telemetry_test_vs { creation-time 2022-10-04:19:38:53 destination /Common/10.20.4.16:80 ip-protocol tcp last-modified-time 2022-10-04:19:38:53 mask 255.255.255.255 pool /Common/telemetry_test_pool profiles { /Common/http { } /Common/tcp { } /Common/telemetry { } } serverssl-use-sni disabled source 0.0.0.0/0 source-address-translation { type automap } translate-address enabled translate-port enabled }

ltm pool /Common/telemetry_test_pool { members { /Common/12.0.0.180:80 { address 12.0.0.180 } } monitor /Common/tcp }

Send ~ 700 request/second to telemetry_test_vs.

  1. Observe the following error response: Restnoded high CPU and Memory, restarted in about 7 mins.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

20972 restnod+ 20 0 2821160 1.6g 16860 R 124.6 0.3 7:07.54 /usr/bin/f5-rest-node /usr/share/rest/node/src/restnode.js -p 8105 --logLevel finest -i /var/log/restnoded/restnod+

top - 15:20:04 up 1 day, 20:28, 1 user, load average: 4.11, 3.23, 2.70

Tasks: 1 total, 1 running, 0 sleeping, 0 stopped, 0 zombie

%Cpu(s): 6.8 us, 1.5 sy, 0.6 ni, 91.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st

KiB Mem : 52832300+total, 35512116 free, 48742419+used, 5386728 buff/cache

KiB Swap: 5242876 total, 5242876 free, 0 used. 35985340 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

20972 restnod+ 20 0 2817576 1.5g 16860 R 126.2 0.3 7:11.34 /usr/bin/f5-rest-node /usr/share/rest/node/src/restnode.js -p 8105 --logLevel finest -i /var/log/restnoded/restnod+

Broadcast message from [systemd-journald@HQF5LB02] (Wed 2022-11-09 15:20:05 CST):

logger[24708]: Re-starting restnoded

2022 Nov 9 15:20:05 [HQF5LB02] logger[24708]: Re-starting restnoded

Expected Behavior

Restnoded does not restart

Actual Behavior

Restnoded restarts

PS: A F5 Support case [00277783] has been filed for the same issue.

eos500n2000 commented 1 year ago

1.) Tested ~ 700 http requests per second in lab for request logging. restnoded CPU was 100%, Memory went to 1.6G very quickly and Restnoded restarted after about 7~8 mins.

2.) I used http consumer in lab ( customer uses Azure log analysis). PE suggested to add two customOpts as below. "Httptest_Consumer": { "class": "Telemetry_Consumer", "type": "Generic_HTTP", "host": "10.20.4.176", "protocol": "http", "port": 8001, "enable": true, "trace": true, "path": "/", "allowSelfSignedCert": false, "method": "POST", "customOpts": [ { "name": "keepAlive","value": true }, ##### { "name": "maxSockets", "value": 30} ##### ]

There is no improvements. Restnoded still restarts after about 7 mins.

And there is no such tuning options for Azure LA consumer anyway.

3.) Also tested disabled/ removed all consumers. ( for debugging)

"Httptest_Consumer": { "class": "Telemetry_Consumer", "type": "Generic_HTTP", "host": "10.20.4.176", "protocol": "http", "port": 8001, "enable": false, ######## disabled "trace": true, "path": "/",

With 700 TPS http traffic, restnodes memory was stable, CPU is between 50%~70% (comparing with ~0% when there is no traffic.) And no restarts.

G-gonzalezjimenez commented 1 year ago

Thanks, created ticket TS718 internally