StamusNetworks / SELKS

A Suricata based IDS/IPS/NSM distro
https://www.stamus-networks.com/open-source/#selks
GNU General Public License v3.0
1.48k stars 285 forks source link

🐞🐋Newly updated dockerized suricata segmentation fault on high traffic ... #475

Closed ulysse31 closed 3 months ago

ulysse31 commented 4 months ago

Is there an existing issue for this?

Current Behavior

After updating two SELKS nodes, seems that the latest suricata docker version is crash looping with a segmentation fault, depending on the traffic amount to analyze I tried wiping suricata container and its related data did not help (suricata container still crash loop). I then wiped all SELKS containers and their data, did not help (suricata container still crash loop).

on the dmesg of the host I get this :

[Fri Jul 26 10:44:31 2024] W#06-bond1[78735]: segfault at 0 ip 00000000009349a9 sp 00007f853fffc270 error 4 in suricata[4d4000+637000] likely on CPU 22 (core 14, socket 0) [Fri Jul 26 10:44:31 2024] Code: 74 24 50 48 85 f6 74 0b ba 01 00 00 00 ff 15 76 8c 44 00 48 89 df e8 06 06 ba ff 0f 0b 0f 1f 40 00 48 83 ec 18 48 85 d2 74 38 <0f> b6 06 89 c1 83 e1 1f 41 b8 01 00 00 00 83 f9 1f 75 5b 48 83 fa

In loop ...

and the last lines of the container before starting again are :

Perf: af-packet: bond1: rx ring: block_size=32768 block_nr=2 frame_size=1600 frame_nr=40 [AFPComputeRingParams:source-af-packet.c:1598] Perf: af-packet: bond1: rx ring: block_size=32768 block_nr=2 frame_size=1600 frame_nr=40 [AFPComputeRingParams:source-af-packet.c:1598] Perf: af-packet: bond1: rx ring: block_size=32768 block_nr=2 frame_size=1600 frame_nr=40 [AFPComputeRingParams:source-af-packet.c:1598] Perf: af-packet: bond1: rx ring: block_size=32768 block_nr=2 frame_size=1600 frame_nr=40 [AFPComputeRingParams:source-af-packet.c:1598] Notice: threads: Threads created -> W: 64 FM: 1 FR: 1 Engine started. [TmThreadWaitOnThreadRunning:tm-threads.c:1905]

UPDATE: after digging, seems that both of my SELKS instance have the dockerized suricata version that crashes, one crashes almost every minute because of the amount of traffic to analyze ... Also after digging, seems that SELKS project is using the "master-amd64" image ... which, following suricata docker github, is the "latest code dev version available" ... which does not seem particularly a "wise" choice for stability ? If you have any idea on any potential debug command that would give me more hints ... I would really appreciate it ^^' Thanks a lot

Expected Behavior

After upgrade have a suricata container that works ...

Steps To Reproduce

  1. have SELKS installed on a host that will listen to a bond interface
  2. use the latest version of suricata
  3. have a lot of traffic (crashes around after a min)

Docker version

Docker version 27.0.3, build 7d4bcd8

Docker version

Docker Compose version v2.28.1

OS Version

Debian GNU/Linux 12 (bookworm)

Content of the environnement File

COMPOSE_PROJECT_NAME=selks INTERFACES= -i bond1 ELASTIC_MEMORY=64G SCIRIUS_SECRET_KEY=

Version of SELKS

commit 4af455cd15f69f2ba471fa6cd0b96d6aae6e93b9 (HEAD -> master, origin/master, origin/HEAD) Author: Peter Manev pmanev@stamus-networks.com Date: Thu Jun 13 13:18:18 2024 +0200

docker: Add Logstash/Kibana docker versions

Anything else?

As always, thanks for your help ^^'

ulysse31 commented 4 months ago

UPDATE:

I was thinking that it may be related to bonding ... But it seems that it does also segmentation fault on the other "interface direct" server :

[Fri Jul 26 06:24:41 2024] W#09-eno2np1[3532764]: segfault at 0 ip 00000000009349a9 sp 00007f1f5fffc270 error 4 in suricata[4d4000+637000] likely on CPU 4 (core 1, socket 0) [Fri Jul 26 06:24:41 2024] Code: 74 24 50 48 85 f6 74 0b ba 01 00 00 00 ff 15 76 8c 44 00 48 89 df e8 06 06 ba ff 0f 0b 0f 1f 40 00 48 83 ec 18 48 85 d2 74 38 <0f> b6 06 89 c1 83 e1 1f 41 b8 01 00 00 00 83 f9 1f 75 5b 48 83 fa

This one is New york time zone (the other one is Paris timezone) So it segmentation fault on both ... but the big difference is potentially on the bandwidth: one is a single 10Gbps interface, the other one is a bonding of 2 10Gbps interface, because of the traffic volume. So, to reformulate, the latest version of docker suricata, seems to segmentation fault on High traffic (average 20MBytes/s on bond1) The other one in New York is right now around 2/3Mbytes/s (low activity / early morning)

ulysse31 commented 4 months ago

UPDATE2:

Confirmed after traffic waking up in New York ...

[Fri Jul 26 08:10:34 2024] W#31-eno2np1[3671915]: segfault at 0 ip 00000000009349a9 sp 00007f31ad4f1270 error 4 in suricata[4d4000+637000] likely on CPU 6 (core 6, socket 0) [Fri Jul 26 08:10:34 2024] Code: 74 24 50 48 85 f6 74 0b ba 01 00 00 00 ff 15 76 8c 44 00 48 89 df e8 06 06 ba ff 0f 0b 0f 1f 40 00 48 83 ec 18 48 85 d2 74 38 <0f> b6 06 89 c1 83 e1 1f 41 b8 01 00 00 00 83 f9 1f 75 5b 48 83 fa

Seems that docker suricata no longer support high traffic and crashes on high traffic ...

pevma commented 4 months ago

How often does this happen?
What is the output of :

docker exec suricata suricata --build-info 
ulysse31 commented 4 months ago

UPDATE3:

Updated the title, since I can now confirm that the segmentation fault / crash appear starting from a certain traffic activity on both of my test systems ... I've tried master-amd64, master-profiling, master ... they all do the same segmentation fault crash loop on high traffic ...

ulysse31 commented 4 months ago

How often does this happen? What is the output of :

docker exec suricata suricata --build-info 

Hello ! Thanks for your reply, here is the output :

This is Suricata version 8.0.0-dev (7f6c963ac 2024-07-20) Features: NFQ PCAP_SET_BUFF AF_PACKET HAVE_PACKET_FANOUT LIBCAP_NG LIBNET1.1 HAVE_HTP_URI_NORMALIZE_HOOK PCRE_JIT HAVE_NSS HTTP2_DECOMPRESSION HAVE_LUA HAVE_JA3 HAVE_JA4 HAVE_LIBJANSSON TLS TLS_C11 MAGIC RUST POPCNT64 SIMD support: SSE_4_2 SSE_4_1 SSE_3 SSE_2 Atomic intrinsics: 1 2 4 8 16 byte(s) 64-bits, Little-endian architecture GCC version 11.4.1 20231218 (Red Hat 11.4.1-3), C version 201112 compiled with _FORTIFY_SOURCE=0 L1 cache line size (CLS)=64 thread local storage method: _Thread_local compiled with LibHTP v0.5.48, linked against LibHTP v0.5.48

Suricata Configuration: AF_PACKET support: yes AF_XDP support: no DPDK support: yes eBPF support: yes XDP support: yes PF_RING support: no NFQueue support: yes NFLOG support: no IPFW support: no Netmap support: no DAG enabled: no Napatech enabled: no WinDivert enabled: no

Unix socket enabled: yes Detection enabled: yes

Libmagic support: yes libjansson support: yes hiredis support: yes hiredis async with libevent: yes PCRE jit: yes GeoIP2 support: yes JA3 support: yes JA4 support: yes Non-bundled htp: no Hyperscan support: yes Libnet support: yes liblz4 support: yes Landlock support: yes Systemd support: yes

Rust support: yes Rust strict mode: no Rust compiler path: /usr/bin/rustc Rust compiler version: rustc 1.75.0 (82e1608df 2023-12-21) (Red Hat 1.75.0-1.el9) Cargo path: /usr/bin/cargo Cargo version: cargo 1.75.0

Python support: yes Python path: /usr/bin/python3 Install suricatactl: yes Install suricatasc: yes Install suricata-update: yes

Profiling enabled: no Profiling locks enabled: no Profiling rules enabled: no

Plugin support (experimental): yes DPDK Bond PMD: no

Development settings: Coccinelle / spatch: no Unit tests enabled: no Debug output enabled: no Debug validation enabled: no Fuzz targets enabled: no

Generic build parameters: Installation prefix: /usr Configuration directory: /etc/suricata/ Log directory: /var/log/suricata/

--prefix /usr --sysconfdir /etc --localstatedir /var --datarootdir /usr/share

Host: x86_64-pc-linux-gnu Compiler: gcc (exec name) / g++ (real) GCC Protect enabled: no GCC march native enabled: no GCC Profile enabled: no Position Independent Executable enabled: no CFLAGS -g -O2 -fPIC -std=c11 -I/usr/include/dpdk -include rte_config.h -march=corei7 -mrtm -I${srcdir}/../rust/gen -I${srcdir}/../rust/dist -I../rust/gen PCAP_CFLAGS SECCFLAGS

Has I said earlier, from their github, the "master-amd64" branch used is the latest available code on their latest github code branch ... and I would suspect the code is actually broken ...

ulysse31 commented 4 months ago

How often does this happen? What is the output of :

docker exec suricata suricata --build-info 

Oh and for the frequency => it depends on the traffic amount :

Hope this helps.

pevma commented 3 months ago

You can switch tot he latest Suricata build like so:

So the only change you need to make is master->latest on the line - https://github.com/StamusNetworks/SELKS/blob/master/docker/compose.yml#L107
Then update the dockers like so: https://github.com/StamusNetworks/SELKS/wiki/Docker#upgrade-all-containers

ulysse31 commented 3 months ago

You can switch tot he latest Suricata build like so:

So the only change you need to make is master->latest on the line - https://github.com/StamusNetworks/SELKS/blob/master/docker/compose.yml#L107 Then update the dockers like so: https://github.com/StamusNetworks/SELKS/wiki/Docker#upgrade-all-containers

Hello,

Yes, now it does work again, but please also specify two important details : 1- You must edit suricata.yaml and replace all occurences of MiB with mB (or mb) because sizing syntax between suricata 8.0 dev (master) and 7.0 (latest) is different, otherwise it will crash with an error at startup 2- This causes the actual install documentation & install procedure to be broken => we must edit compose.yml and suricata.yaml to "patch" ourselves manually the actual setup.

Other than that ... we are good ^^'

ulysse31 commented 3 months ago

Side note : if the suricata version 8.0 dev is NOT necessary to run SELKS, why is the current install using a "dev" version prone to those kind of code error / crash issues ? why not use latest always ? shouldn't the documentation and the actual setup script and config be corrected to use it ? isn't it a better stable solution ? Thanks for your answer.

pevma commented 3 months ago

For reference so we can chase it down - Can you give some examples of where exactly in the config you needed to edit the MiB occurrences that are needed to be edited?

Aso , you should probably have acore file inside the suricata docker - is this the case? You can use find inside the docker to see if one exists so we can try to trace the reason for the segfault.

ulysse31 commented 3 months ago

Hello,

The MiB unit is not understood by suricata latest (7.0), so all mentions on the suricata.yaml using it is incorrect. You can try it yourself by modifying the compose.yml and use latest instead of master-amd64. When using suricata latest, the config error does not generate a segmentation fault, but a fatal error that makes suricata quit, since the docker container is configured to start again when process quits, it boot loop. The segmentation fault happens on high traffic with docker image "master-amd64", which again, is suricata 8.0 dev version (current compiled code from github) I also indicated this error to suricata docker github, an issue is also opened about this. Again, since the master-amd64 is the dev code, it does not surprises me that " sometimes" the compiled code gets buggy. The segmentation fault seems clearly to be linked with traffic amount: Since I had one of the two servers crashing at first, I though it was related to one using a bond interface, but after waiting that activity goes up on the one that wasn't crashed at first, he then also crashed as well with the load ... I'll look into the folder to see if i still have some core somewhere (i cleaned up after getting it working again) Anyways, i really would like to know why SELKS uses the dev / unstable version of suricata, instead of the latest / stable version ? Thanks

Le sam. 27 juil. 2024, 19:40, Peter Manev @.***> a écrit :

For reference so we can chase it down - Can you give some examples of where exactly in the config you needed to edit the MiB occurrences that are needed to be edited?

Aso , you should probably have acore file inside the suricata docker - is this the case? You can use find inside the docker to see if one exists so we can try to trace the reason for the segfault.

— Reply to this email directly, view it on GitHub https://github.com/StamusNetworks/SELKS/issues/475#issuecomment-2254224118, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABKSDLSDYVAOLUKTHPIGO2TZOPSQFAVCNFSM6AAAAABLQFSEC6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJUGIZDIMJRHA . You are receiving this because you authored the thread.Message ID: @.***>

pevma commented 3 months ago

Aha understood - you mean the original suricata.yaml had mib in it and with latest Suricata it complains about it. Ok but i would expect the original suricata.yaml to be replaced by the new one latest during the docker pull/update - did that not happen ?

SELKS has always (for about 10 years now) , by default used latest Suricata, to showcase the newest features and latest additions. Since it is docker based it is very easy to switch away to any Suricata version desired though. This is the first segfault that i remember is reported. That's why the question if you could find the core file?

Thanks for reporting it !

ulysse31 commented 3 months ago

As you said "using the latest" ... But in fact its not using the latest stable ^^' its using the dev unstable version And no it did not replaced it, you have to keep in mind that the suricata.yaml is NOT inside de container ^^ So following the upgrade instructions with the down the pull and up -d ... won't change it ^^' UPDATE: hmm ... it did replaced the suricata.yml on my second node I updated today ... strange.

ulysse31 commented 3 months ago

So my question do still stands ^^' why is SELKS using the dev unstable branch of suricata and not the latest stable (7.0)? EDIT: let me try to reformulate => which feature is such crucial and only available on 8.0 dev that worth the risk to use the dev unstable instead of the latest stable ?

ulysse31 commented 3 months ago

Just FYI: before we discuss about this, I contacted Jason Ish from Suricata, in order to signal him that the master-amd64 docker version was making segmentation fault on "high" traffic (on 20Mbyte/s to up), he didn't seem surprised at all, and told me that the master branch is the dev version of suricata, it takes the latest code version, which may contain unstable code, and makes an image of it. He also told me that correcting an issue on it would take some days ... SELKS is really a great project, but I'm just worried that it uses by default unstable code, on a project aimed to be used in production ...

ulysse31 commented 3 months ago

UPDATE: seems that suricata docker master branch image (8.0 dev) was updated 11 hours ago ... maybe the issue is now fixed ? ^^'

pevma commented 3 months ago

Yes, SELKS is running the latest master/dev as mentioned here: https://github.com/StamusNetworks/SELKS/issues/475#issuecomment-2253757117 We will change it as you noted however as it was left over from previous SELKS deployments/versions.

ulysse31 commented 3 months ago

Thanks a lot ! I'm trying my best to implement SELKS here, and potentially thinking maybe to see a pro version later on if people get convinced ^^' I suppose we can assume everything is OK now. Have a great day & week ^^

pevma commented 3 months ago

Thanks! Glad it worked out !