bunkerity / bunkerweb

🛡️ Open-source and next-generation Web Application Firewall (WAF)
https://www.bunkerweb.io
GNU Affero General Public License v3.0
6.4k stars 361 forks source link

[BUG] High CPU usage #154

Closed Crazy3lf closed 2 years ago

Crazy3lf commented 3 years ago

Description I was trying to upgrade my previous stack to swarm mode with autoconfig. I notice the CPU quickly reach 100% on my droplet after a few minutes. image docker stats:

CONTAINER ID   NAME                         CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O       PIDS
2e2234b10985   bunkerized-nginx-container   97.37%    274.8MiB / 981.3MiB   28.00%    217kB / 25.9MB    349MB / 0B      3

docker top bunkerized-nginx-container:

UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
systemd+            2475890             2475861             71                  10:44               ?                   00:42:47            /bin/bash /opt/entrypoint/entrypoint.sh
systemd+            2476001             2475890             0                   10:44               ?                   00:00:00            crond
systemd+            2476830             2475890             0                   10:46               ?                   00:00:00            nginx: worker process

This is a droplet I use for testing so it is in idle state (not handling a lot of requests), my old stack (with v1.2.3) is under 4% CPU at idle state.

How to reproduce docker-compose.yml:

version: '3.8'
services:
  autoconf:
    image: bunkerity/bunkerized-nginx-autoconf:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - confs:/etc/nginx
      - server_confs:/server-confs
      - certs:/etc/letsencrypt
      - acme-challenge:/acme-challenge
    environment:
      - SWARM_MODE=yes
      - API_URI=/autoconfchallenge
    networks:
      - net_config
    deploy:
      replicas: 1

  nginx:
    image: bunkerity/bunkerized-nginx:1.2.7
    ports:
      - published: 80
        target: 8080
        mode: host
        protocol: tcp
      - published: 443
        target: 8443
        mode: host
        protocol: tcp
    volumes:
      - confs:/etc/nginx
      - server_confs:/server-confs
      - certs:/etc/letsencrypt
      - acme-challenge:/acme-challenge:ro
      - www:/www:ro
    environment:
      - MULTISITE=yes
      - SERVER_NAME=frontend.example.com
      - DISABLE_DEFAULT_SERVER=yes
      - SWARM_MODE=yes
      - USE_API=yes
      - API_URI=/autoconfchallenge
      - AUTO_LETS_ENCRYPT=yes
      - REDIRECT_HTTP_TO_HTTPS=yes
      - PROXY_REAL_IP=yes
      - USE_BAD_BEHAVIOR=yes
      - USE_CLIENT_CACHE=yes
    networks:
      - net_config
      - net_services
    deploy:
      mode: global
      labels:
        - "bunkerized-nginx.AUTOCONF"

  backend:
    image: my-backend-image:latest
    networks:
      - net_services
    deploy:
      replicas: 1
      labels:
        - "bunkerized-nginx.SERVER_NAME=backend.example.com"
        - "bunkerized-nginx.SERVE_FILES=no"
        - "bunkerized-nginx.USE_MODSECURITY_CRS=no"
        - "bunkerized-nginx.USE_MODSECURITY=yes"
        - "bunkerized-nginx.USE_BAD_BEHAVIOR=yes"

networks:
  net_config:
    driver: overlay
  net_services:
    driver: overlay

volumes:
  confs:
  server_confs:
  certs:
  www:
  data:
  acme-challenge:

Logs

autoconfig logs:

...
[2021-07-15 02:44:33] [*] Activated config for backend.example.com
[2021-07-15 02:44:33] [*] Starting jobs (type = post) ...
[2021-07-15 02:44:34] [*] Starting jobs (type = post) ...
[2021-07-15 02:44:34] [*] Jobs stdout :
[2021-07-15 02:44:34] [*] Copying cached referrers.list ...

[2021-07-15 02:44:34] [*] Starting jobs (type = post) ...
[2021-07-15 02:44:34] [*] Jobs stdout :
[2021-07-15 02:44:34] [*] Copying cached user-agents.list ...
[*] Copying cached referrers.list ...

[2021-07-15 02:44:41] [*] Starting jobs (type = post) ...
[2021-07-15 02:44:41] [*] Jobs stdout :
[2021-07-15 02:44:41] [*] Copying cached user-agents.list ...
[*] Copying cached referrers.list ...
[*] Copying cached tor-exit-nodes.list ...

[2021-07-15 02:46:19] [*] Sent API order /reload to instance bunkerized-nginx-container (service.node.task)
[2021-07-15 02:46:19] [*] Starting jobs (type = post) ...
[2021-07-15 02:46:19] [*] Jobs stdout :
[2021-07-15 02:46:19] [*] Copying cached user-agents.list ...
[*] Copying cached referrers.list ...
[*] Copying cached tor-exit-nodes.list ...
[*] Copying cached abusers.list ...

[2021-07-15 03:00:01] [*] Sent API order /reload to instance bunkerized-nginx-container (service.node.task)
[2021-07-15 03:00:01] [*] Starting jobs (type = post) ...
[2021-07-15 03:00:02] [*] Jobs stdout :
[2021-07-15 03:00:02] [*] Copying cached user-agents.list ...
[*] Copying cached referrers.list ...
[*] Copying cached tor-exit-nodes.list ...
[*] Copying cached proxies.list ...
[*] Copying cached abusers.list ...

bunkerized-nginx logs:

2021/07/15 02:45:41 [info] 89#89: *10 SSL_do_handshake() failed (SSL: error:14094412:SSL routines:ssl3_read_bytes:sslv3 alert bad certificate:SSL alert number 42) while SSL handshaking, client: [ip-address], server: 0.0.0.0:8443
2021/07/15 02:46:18 [notice] 107#107: ModSecurity-nginx v1.0.2 (rules loaded inline/local/remote: 0/915/0)
2021/07/15 02:46:18 [notice] 107#107: signal process started
2021/07/15 02:46:19 [notice] 89#89: *11 [API] API call /autoconfchallenge/reload successfull from 10.0.54.3, client: 10.0.54.3, server: _, request: "POST /autoconfchallenge/reload HTTP/1.1", host: "bunkerized-nginx-container:8080"
bunkerized-nginx-container 10.0.54.3 - - [15/Jul/2021:02:46:18 +0000] "POST /autoconfchallenge/reload HTTP/1.1" 200 12 "-" "python-requests/2.25.1"
2021/07/15 02:46:19 [info] 89#89: *11 client 10.0.54.3 closed keepalive connection
2021/07/15 02:46:19 [error] 9#9: [INIT] *NOT AN ERROR* loaded 1782383 IPs from /etc/nginx/proxies.list
2021/07/15 02:46:19 [error] 9#9: [INIT] *NOT AN ERROR* loaded 213387 IPs from /etc/nginx/abusers.list
2021/07/15 02:46:19 [error] 9#9: [INIT] *NOT AN ERROR* loaded 1045 IPs from /etc/nginx/tor-exit-nodes.list
2021/07/15 02:46:19 [error] 9#9: [INIT] *NOT AN ERROR* loaded 1713 entries from /etc/nginx/user-agents.list
2021/07/15 02:46:19 [error] 9#9: [INIT] *NOT AN ERROR* loaded 7068 entries from /etc/nginx/referrers.list
backend.example.com [ip-address] - - [15/Jul/2021:02:46:28 +0000] "GET / HTTP/2.0" 401 27 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0"
2021/07/15 03:00:01 [notice] 311#311: ModSecurity-nginx v1.0.2 (rules loaded inline/local/remote: 0/915/0)
2021/07/15 03:00:01 [notice] 311#311: signal process started
2021/07/15 03:00:01 [notice] 160#160: *15 [API] API call /autoconfchallenge/reload successfull from 10.0.54.3, client: 10.0.54.3, server: _, request: "POST /autoconfchallenge/reload HTTP/1.1", host: "bunkerized-nginx-container:8080"
bunkerized-nginx-container 10.0.54.3 - - [15/Jul/2021:03:00:01 +0000] "POST /autoconfchallenge/reload HTTP/1.1" 200 12 "-" "python-requests/2.25.1"
2021/07/15 03:00:01 [info] 160#160: *15 client 10.0.54.3 closed keepalive connection
/opt/entrypoint/entrypoint.sh: line 105:     9 Killed                  nginx

Additional info The CPU usage shown above is in my timezone (GMT+8). The CPU usage spike at about 11:01 which is 03:01 UTC, 1 minutes after the second reload called. Hope this info helps.

fl0ppy-d1sk commented 3 years ago

Hello @Crazy3lf,

Just in case, can you share your CPU spec ?

I will try to reproduce the issue when I have time and will keep you updated.

Crazy3lf commented 3 years ago

@fl0ppy-d1sk It was shared-cpu droplet on DigitalOcean. lscpu output:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   40 bits physical, 48 bits virtual
CPU(s):                          1
On-line CPU(s) list:             0
Thread(s) per core:              1
Core(s) per socket:              1
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           63
Model name:                      DO-Regular
Stepping:                        2
CPU MHz:                         2494.106
BogoMIPS:                        4988.21
Virtualization:                  VT-x
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       32 KiB
L1i cache:                       32 KiB
L2 cache:                        4 MiB
NUMA node0 CPU(s):               0
fl0ppy-d1sk commented 3 years ago

Hello @Crazy3lf,

Can you try the new v1.3.0 release ? I've made a refactoring of the different jobs, it may have changed something regarding the CPU usage.

Crazy3lf commented 3 years ago

Hi @fl0ppy-d1sk , I have tried it, the CPU still goes up to 100% but it is slower than in v1.2.7. In v1.2.7 it reach 100% in minutes, but in v1.3.0, it reach after around 30 minutes to an hour.

image

Logs logs from 2nd try
autoconfig logs:

[*] Starting autoconf ...
...
[2021-08-25 02:25:44] job - INFO - executing job abusers
[2021-08-25 02:25:45] job - INFO - job abusers successfully executed
[2021-08-25 02:25:45] job - INFO - skipped reload operation because it's not needed (job = abusers)

[2021-08-25 02:25:46] config - INFO - waiting 2 seconds before retrying to contact bunkerized-nginx instances
[2021-08-25 02:25:48] config - INFO - waiting 3 seconds before retrying to contact bunkerized-nginx instances
[2021-08-25 02:25:52] config - INFO - waiting 4 seconds before retrying to contact bunkerized-nginx instances
[2021-08-25 02:25:56] config - INFO - waiting 5 seconds before retrying to contact bunkerized-nginx instances
[2021-08-25 02:26:01] config - INFO - successfully sent API order to http://10.0.65.7:8080/autoconfchallenge/ping
[2021-08-25 02:26:01] config - INFO - bunkerized-nginx instances started
[2021-08-25 02:26:01] autoconf - INFO - bunkerized-nginx instances started
[2021-08-25 02:26:01] autoconf - INFO - waiting for events ...
[2021-08-25 03:00:00] job - INFO - executing job exit-nodes
[2021-08-25 03:00:01] job - INFO - job exit-nodes successfully executed
[2021-08-25 03:00:01] config - INFO - successfully sent API order to http://10.0.65.7:8080/autoconfchallenge/reload
[2021-08-25 03:00:01] job - INFO - reload operation successfully executed (job = exit-nodes)
[2021-08-25 04:00:01] job - INFO - executing job exit-nodes
[2021-08-25 04:00:02] job - INFO - job exit-nodes successfully executed

bunkerized-nginx logs:

[2021-08-25 02:25:26] entrypoint - INFO - starting bunkerized-nginx ...
[2021-08-25 02:25:26] entrypoint - INFO - skipping configuration process
[2021-08-25 02:25:26] entrypoint - INFO - waiting until config has been generated ...
[2021-08-25 02:25:26] entrypoint - INFO - running nginx ...
2021/08/25 02:25:26 [notice] 12#12: ModSecurity-nginx v1.0.2 (rules loaded inline/local/remote: 0/913/0)
[2021-08-25 02:25:19] entrypoint - INFO - starting bunkerized-nginx ...
[2021-08-25 02:25:19] entrypoint - INFO - configuring bunkerized-nginx ...
[*] Successfully started temp nginx
[2021-08-25 02:25:19] entrypoint - INFO - waiting until config has been generated ...
2021/08/25 02:25:19 [notice] 58#58: signal process started
[2021-08-25 02:25:19] entrypoint - INFO - running nginx ...
2021/08/25 02:25:19 [emerg] 60#60: open() "/etc/nginx/backend.example.com/server.conf" failed (2: No such file or directory) in /etc/nginx/nginx.conf:119
nginx: [emerg] open() "/etc/nginx/backend.example.com/server.conf" failed (2: No such file or directory) in /etc/nginx/nginx.conf:119
[2021-08-25 02:25:19] entrypoint - INFO - bunkerized-nginx stopped
2021/08/25 02:25:26 [error] 12#12: [INIT] *NOT AN ERROR* loaded 1847172 IPs from /etc/nginx/proxies.list
nginx: [error] [INIT] *NOT AN ERROR* loaded 1847172 IPs from /etc/nginx/proxies.list
2021/08/25 02:25:26 [error] 12#12: [INIT] *NOT AN ERROR* loaded 229417 IPs from /etc/nginx/abusers.list
nginx: [error] [INIT] *NOT AN ERROR* loaded 229417 IPs from /etc/nginx/abusers.list
2021/08/25 02:25:26 [error] 12#12: [INIT] *NOT AN ERROR* loaded 1286 IPs from /etc/nginx/tor-exit-nodes.list
nginx: [error] [INIT] *NOT AN ERROR* loaded 1286 IPs from /etc/nginx/tor-exit-nodes.list
2021/08/25 02:25:26 [error] 12#12: [INIT] *NOT AN ERROR* loaded 1984 entries from /etc/nginx/user-agents.list
nginx: [error] [INIT] *NOT AN ERROR* loaded 1984 entries from /etc/nginx/user-agents.list
2021/08/25 02:25:26 [error] 12#12: [INIT] *NOT AN ERROR* loaded 7080 entries from /etc/nginx/referrers.list
nginx: [error] [INIT] *NOT AN ERROR* loaded 7080 entries from /etc/nginx/referrers.list
2021/08/25 02:26:01 [notice] 36#36: *1 [API] API call /autoconfchallenge/ping successfull from 10.0.65.3, client: 10.0.65.3, server: _, request: "POST /autoconfchallenge/ping HTTP/1.1", host: "10.0.65.7:8080"
2021/08/25 02:26:01 [info] 36#36: *1 client 10.0.65.3 closed keepalive connection
10.0.65.7 10.0.65.3 - - [25/Aug/2021:02:26:01 +0000] "POST /autoconfchallenge/ping HTTP/1.1" 200 12 "-" "python-requests/2.25.1"
2021/08/25 03:00:01 [notice] 391#391: ModSecurity-nginx v1.0.2 (rules loaded inline/local/remote: 0/913/0)
2021/08/25 03:00:01 [notice] 391#391: signal process started
2021/08/25 03:00:01 [notice] 36#36: *10 [API] API call /autoconfchallenge/reload successfull from 10.0.65.3, client: 10.0.65.3, server: _, request: "POST /autoconfchallenge/reload HTTP/1.1", host: "10.0.65.7:8080"
10.0.65.7 10.0.65.3 - - [25/Aug/2021:03:00:01 +0000] "POST /autoconfchallenge/reload HTTP/1.1" 200 12 "-" "python-requests/2.25.1"
2021/08/25 03:00:01 [info] 36#36: *10 client 10.0.65.3 closed keepalive connection
/opt/bunkerized-nginx/entrypoint/entrypoint.sh: line 110:    12 Killed                  nginx -g 'daemon off;'
2021/08/25 03:14:36 [info] 36#36: *12 SSL_do_handshake() failed (SSL: error:1420918C:SSL routines:tls_early_post_process_client_hello:version too low) while SSL handshaking, client: 64.62.197.2, server: 0.0.0.0:8443
Crazy3lf commented 3 years ago

I found out that the timing of the CPU usage climb happen after minute 0 of the next hour, which maybe correspond to this cron job. I hope this info helps.

fl0ppy-d1sk commented 2 years ago

Tested new BunkerWeb in a DO instance and CPU usage is no more a problem.