elastic / beats

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash
https://www.elastic.co/products/beats
Other
12.16k stars 4.91k forks source link

Auditbeat 7.7.x Poor Performance: 100%+ CPU Usage with System Module Socket Dataset Enabled #19141

Closed BenB196 closed 4 years ago

BenB196 commented 4 years ago

Auditbeat 7.7.x with the System Module Socket Dataset enabled, will randomly start using 100%+ CPU on some servers. This was not an issue prior to 7.7.x.

Restarting the Auditbeat services causes CPU usage to go back to normal for a bit, but it will eventually start having issues again.

This issue doesn't seem to happen on every server, running Auditbeat on ~100 servers with the same config (below), the issue appears to occur on 10-15% of the servers. I see the issue on both OpenSUSE and CentOS servers, on multiple different kernels, and running different apps.

Screenshot showing issue (Percentages on the graph are of total CPU, not of individual cores, this example server has 4 cores, meaning Auditbeat is using one of them completely for itself):

image

Version Output:

auditbeat version
auditbeat version 7.7.1 (amd64), libbeat 7.7.1 [932b273e8940575e15f10390882be205bad29e1f built 2020-05-28 15:20:33 +0000 UTC]

System versions:

# uname -a
Linux server 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

# cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

Configuration:

###################### Auditbeat Configuration #########################

#==========================  Modules configuration =============================
auditbeat.modules:

- module: auditd
  resolve_ids: true
  failure_mode: silent
  backlog_limit: 8192
  rate_limit: 0
  include_raw_message: false
  include_warnings: false
  backpressure_stratgey: auto
  # Load audit rules from separate files. Same format as audit.rules(7).
  audit_rule_files: [ '${path.config}/audit.rules.d/*.conf' ]
  audit_rules: |
    ## Define audit rules here.
    ## Create file watches (-w) or syscall audits (-a or -A). Uncomment these
    ## examples or add your own rules.

    ## If you are on a 64 bit platform, everything should be running
    ## in 64 bit mode. This rule will detect any use of the 32 bit syscalls
    ## because this might be a sign of someone exploiting a hole in the 32
    ## bit API.
    -a always,exit -F arch=b32 -S all -F key=32bit-abi

    ## Executions.
    -a always,exit -F arch=b64 -S execve,execveat -k exec

    ## External access (warning: these can be expensive to audit).
    -a always,exit -F arch=b64 -S accept,bind,connect -F key=external-access

    ## Identity changes.
    -w /etc/group -p wa -k identity
    -w /etc/passwd -p wa -k identity
    -w /etc/gshadow -p wa -k identity
    -w /etc/shadow -p wa -k identity

    ## Unauthorized access attempts.
    -a always,exit -F arch=b32 -S open,creat,truncate,ftruncate,openat,open_by_handle_at -F exit=-EACCES -k access
    -a always,exit -F arch=b32 -S open,creat,truncate,ftruncate,openat,open_by_handle_at -F exit=-EPERM -k access
    -a always,exit -F arch=b64 -S open,creat,truncate,ftruncate,openat,open_by_handle_at -F exit=-EACCES -k access
    -a always,exit -F arch=b64 -S open,creat,truncate,ftruncate,openat,open_by_handle_at -F exit=-EPERM -k access

- module: file_integrity
  paths:
  - /bin
  - /usr/bin
  - /sbin
  - /usr/sbin
  - /etc
  - /root
  - /usr/local/bin
  - /home
  exclude_files:
  - '(?i)\.sw[nop]$'
  - '~$'
  - '/\.git($|/)'
  - '\.rrd$'
  include_files: []
  scan_at_start: true
  scan_rate_per_sec: 50 MiB
  max_file_size: 100 MiB
  hash_types: [md5,sha256]
  recursive: true

- module: system
  datasets:
    - host    # General host information, e.g. uptime, IPs
    - login   # User logins, logouts, and system boots.
    - package # Installed, updated, and removed packages
    - process # Started and stopped processes
    - socket  # Opened and closed sockets
    - user    # User information

  # How often datasets send state updates with the
  # current state of the system (e.g. all currently
  # running processes, all open sockets).
  state.period: 12h

  # Enabled by default. Auditbeat will read password fields in
  # /etc/passwd and /etc/shadow and store a hash locally to
  # detect any changes.
  user.detect_password_changes: true

  # File patterns of the login record files.
  login.wtmp_file_pattern: /var/log/wtmp*
  login.btmp_file_pattern: /var/log/btmp*

#================================ Outputs =====================================

#----------------------------- Logstash output --------------------------------
output.logstash:
  # The Logstash hosts
  hosts: ["<snipped>"]
  loadbalance: true

#================================ Processors =====================================

processors:
  - add_host_metadata: ~
  - add_tags:
      tags: [auditbeat]
  - dns:
      type: reverse
      fields:
        server.ip: server.hostname
        client.ip: client.hostname
        source.ip: source.hostname
        destination.ip: destination.hostname
      nameservers: ['<snipped>']
      tag_on_failure: [_dns_reverse_lookup_failed]

#================================ Logging =====================================

logging.level: info
logging.to_files: true
logging.files:
  path: /var/log/auditbeat
  name: auditbeat
  keepfiles: 2
  permissions: 0600
  rotateeverybytes: 5242880

#============================== X-Pack Monitoring ===============================
monitoring.enabled: true
monitoring.elasticsearch:
  hosts: ["<snipped>"]
  protocol: "https"
  username: "<snipped>"
  password: "<snipped>"
  ssl.enabled: true
  ssl.verification_mode: full
  ssl.certificate_authorities: ["<snipped>"]
monitoring.cluster_uuid: "<snipped>"

For confirmed bugs, please report:

elasticmachine commented 4 years ago

Pinging @elastic/siem (Team:SIEM)

awei82 commented 4 years ago

I'm running into the exact same issue too with Auditbeat 7.7.1 - running on Ubuntu 16.04.

adriansr commented 4 years ago

It looks like you're running into the issue fixed by https://github.com/elastic/beats/pull/19033.

The fix was too late for 7.7.1, but it will make it into 7.8.0.

adriansr commented 4 years ago

Fix available in 7.8.0

tlandschoff-scale commented 4 years ago

I have that version installed and I am still seeing this problem:

$ auditbeat version
auditbeat version 7.8.0 (amd64), libbeat 7.8.0 [f79387d32717d79f689d94fda1ec80b2cf285d30 built 2020-06-14 18:11:10 +0000 UTC]

According to perf top, this is where the CPU time goes:

  42,62%  auditbeat [.] runtime.mapaccess2_fast64
  15,19%  auditbeat [.] github.com/elastic/beats/v7/x-pack/auditbeat/module/system/socket.(*state).ExpireOlder
  10,50%  auditbeat [.] runtime.aeshash64
   7,76%  auditbeat [.] github.com/elastic/beats/v7/x-pack/auditbeat/module/system/socket.(*state).onSockDestroyed
   3,57%  auditbeat [.] time.Time.Before
   2,92%  auditbeat [.] github.com/elastic/beats/v7/x-pack/auditbeat/module/system/socket.(*socket).Timestamp

As this call stack suggests, removing the socket dataset from the system module makes this problem go away:

--- auditbeat.yml.cpuhog    2020-06-23 09:22:49.122378568 +0200
+++ auditbeat.yml   2020-06-23 09:22:58.938317272 +0200
@@ -59,7 +59,7 @@
     - host    # General host information, e.g. uptime, IPs
     - login   # User logins, logouts, and system boots.
     - process # Started and stopped processes
-    - socket  # Opened and closed sockets
+    # - socket  # Opened and closed sockets
     - user    # User information

   # How often datasets send state updates with the
btnrsec commented 4 years ago

Fix available in 7.8.0

I have upgraded a client to auditbeat 7.8.0 and am still experiencing the same issue (on Ubuntu 16.04.6 LTS). One client upgraded from 7.6.1 (without the socket issue) to 7.8.0 and is now getting high CPU usage. Still the workaround is to uncomment the socket dataset.

BenB196 commented 4 years ago

@adriansr could this issue be reopened as the issue does not appear to be fixed in 7.8.0?

adriansr commented 4 years ago

Reopening.

Can someone please provide the output of running Auditbeat with -httpprof :8080 and once it's using 100% cpu, run curl 'http://localhost:8080/debug/pprof/profile?seconds=30' -o profile.prof and share the profile.prof binary file.

BenB196 commented 4 years ago

@adriansr Here are 3 servers with the issue. Attached zip file contains the 3 profiles:

Server A:

#uname -a
Linux assetmgmt01 4.12.14-lp150.12.82-default #1 SMP Tue Nov 12 16:32:38 UTC 2019 (c939e24) x86_64 x86_64 x86_64 GNU/Linux

#cat /etc/os-release
NAME="openSUSE Leap"
VERSION="15.0"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.0"
PRETTY_NAME="openSUSE Leap 15.0"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.0"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"

Server B:

#uname -a
Linux dmiml01-stg 4.12.14-lp150.12.82-default #1 SMP Tue Nov 12 16:32:38 UTC 2019 (c939e24) x86_64 x86_64 x86_64 GNU/Linux

#cat /etc/os-release
NAME="openSUSE Leap"
VERSION="15.0"
ID="opensuse-leap"
ID_LIKE="suse opensuse"
VERSION_ID="15.0"
PRETTY_NAME="openSUSE Leap 15.0"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:opensuse:leap:15.0"
BUG_REPORT_URL="https://bugs.opensuse.org"
HOME_URL="https://www.opensuse.org/"

Server C:

#uname -a
Linux dnsdist 4.18.0-147.5.1.el8_1.x86_64 #1 SMP Wed Feb 5 02:00:39 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

#cat /etc/os-release
NAME="CentOS Linux"
VERSION="8 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="CentOS Linux 8 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:8"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-8"
CENTOS_MANTISBT_PROJECT_VERSION="8"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="8"

auditbeat_profiles.zip

wixaw commented 4 years ago

Hello 7.8 dont fix this issue for me I take this opportunity to say that commenting on "socket" reduced the CPU, but after a while the CPU increased again Whereas with 7.5 auditbeat was completely transparent on my servers Now It is only on servers where there is Apache that it does not work properly I'm still using the original configuration file

image

Thanks

vinnytroia commented 4 years ago

same.

andrewstucki commented 4 years ago

@wixaw & @vinnytroia what versions of auditbeat are you running? The fix for the bug I found was shipped in 7.8.1 which was released on July 27th--trying to determine if this is another issue or if you just need to upgrade the patch version.

vinnytroia commented 4 years ago

Oh. I don’t have 781. Let me try. I will get back thanks

Vinny Troia www.nightlion.com www.vinnytroia.com


From: Andrew Stucki notifications@github.com Sent: Tuesday, August 4, 2020 10:31:40 AM To: elastic/beats beats@noreply.github.com Cc: Vinny Troia vinny@nightlionsecurity.com; Mention mention@noreply.github.com Subject: Re: [elastic/beats] Auditbeat 7.7.x Poor Performance: 100%+ CPU Usage with System Module Socket Dataset Enabled (#19141)

@wixawhttps://github.com/wixaw & @vinnytroiahttps://github.com/vinnytroia what versions of auditbeat are you running? The fix for the bug I found was shipped in 7.8.1 which was release on July 27th--trying to determine if this is another issue or if you just need to upgrade the patch version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/elastic/beats/issues/19141#issuecomment-668666465, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAMY7T5SYXWOGKUQV6G7J3LR7ASVZANCNFSM4N3WETXA.

wixaw commented 4 years ago

Hello I had not seen the information in the 7.8.1 release I installed 7.8.1 on my servers and have no more CPU issues Thank you

mileskelsey commented 3 years ago

I still see this problem in version 7.9.3

HaZet1968 commented 3 years ago

I still have the problem (with version 7.9.1) on machines with a lot of network traffic (e.g squid, webserver), too.