StamusNetworks / SELKS

A Suricata based IDS/IPS/NSM distro
https://www.stamus-networks.com/open-source/#selks
GNU General Public License v3.0
1.49k stars 285 forks source link

Suricata provides no data after some days #331

Open eglyn opened 3 years ago

eglyn commented 3 years ago

Hi all,

I have a dedicated server running selks, and everything works great except after some days, there is no data on all dashboards :/ When I check the health status I have 2 services down:

Here the complete log:

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
● molochviewer-selks.service - Moloch Viewer
   Loaded: loaded (/etc/systemd/system/molochviewer-selks.service; enabled; vendor preset: enabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2021-08-19 13:42:09 CEST; 30s ago
  Process: 5540 ExecStart=/bin/sh -c /data/moloch/bin/node viewer.js -c /data/moloch/etc/config.ini >> /data/moloch/logs/viewer.log 2>&1 (code=exited, status=1/FAILURE)
 Main PID: 5540 (code=exited, status=1/FAILURE)
● molochpcapread-selks.service - Moloch Pcap Read
   Loaded: loaded (/etc/systemd/system/molochpcapread-selks.service; enabled; vendor preset: enabled)
   Active: activating (auto-restart) (Result: exit-code) since Thu 2021-08-19 13:42:07 CEST; 31s ago
  Process: 5537 ExecStart=/bin/sh -c /data/moloch/bin/moloch-capture -c /data/moloch/etc/config.ini -m --copy --delete -R /data/nsm/  >> /data/moloch/logs/capture.log 2>&1 (code=exited, status=1/FAILURE)
 Main PID: 5537 (code=exited, status=1/FAILURE)

If I reboot the server, everything come to normal for few days.

Any ideas ?

pevma commented 3 years ago

Do you use Moloch ? Can you paste the full output of selks-health-check_stamus?

eglyn commented 3 years ago

Full log:

 suricata.service - LSB: Next Generation IDS/IPS
   Loaded: loaded (/etc/init.d/suricata; generated)
   Active: active (running) since Thu 2021-08-19 13:38:43 CEST; 50min ago
     Docs: man:systemd-sysv-generator(8)
  Process: 5356 ExecStart=/etc/init.d/suricata start (code=exited, status=0/SUCCESS)
    Tasks: 14 (limit: 4915)
   Memory: 2.5G
   CGroup: /system.slice/suricata.service
           └─5363 /usr/bin/suricata -c /etc/suricata/suricata.yaml --pidfile /var/run/suricata.pid --af-packet -D -v --user=logstash

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
● elasticsearch.service - Elasticsearch
   Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2021-08-19 13:34:37 CEST; 55min ago
     Docs: https://www.elastic.co
 Main PID: 4714 (java)
    Tasks: 125 (limit: 4915)
   Memory: 37.1G
   CGroup: /system.slice/elasticsearch.service
           ├─4714 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile…
           └─4915 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
● logstash.service - logstash
   Loaded: loaded (/etc/systemd/system/logstash.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-08-16 09:14:22 CEST; 3 days ago
 Main PID: 512 (java)
    Tasks: 56 (limit: 4915)
   Memory: 1.8G
   CGroup: /system.slice/logstash.service
           └─512 /usr/share/logstash/jdk/bin/java -Xms1g -Xmx1g -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.awt.headless=true -Dfile.encoding=…

août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,601][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,601][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,601][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,601][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,602][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,602][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,602][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,602][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,602][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
août 19 14:29:41 TSFE-SV-SELKS logstash[512]: [2021-08-19T14:29:41,602][WARN ][logstash.outputs.elasticsearch][main][e55f734d663b7fb7ca21a05c69227f334d0c6198948f303fac6e50c03be43b13] Could not index ev…
Hint: Some lines were ellipsized, use -l to show in full.
● kibana.service - Kibana
   Loaded: loaded (/etc/systemd/system/kibana.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2021-08-19 13:34:37 CEST; 55min ago
     Docs: https://www.elastic.co
 Main PID: 5039 (node)
    Tasks: 18 (limit: 4915)
   Memory: 439.3M
   CGroup: /system.slice/kibana.service
           ├─5039 /usr/share/kibana/bin/../node/bin/node /usr/share/kibana/bin/../src/cli/dist --logging.dest=/var/log/kibana/kibana.log --pid.file=/run/kibana/kibana.pid
           └─5075 /usr/share/kibana/node/bin/node --preserve-symlinks-main --preserve-symlinks /usr/share/kibana/src/cli/dist --logging.dest=/var/log/kibana/kibana.log --pid.file=/run/kibana/kibana.pid

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
● evebox.service - EveBox Server
   Loaded: loaded (/lib/systemd/system/evebox.service; enabled; vendor preset: enabled)
   Active: active (running) since Mon 2021-08-16 09:14:22 CEST; 3 days ago
 Main PID: 511 (evebox)
    Tasks: 9 (limit: 4915)
   Memory: 5.8M
   CGroup: /system.slice/evebox.service
           └─511 /usr/bin/evebox server

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
● molochviewer-selks.service - Moloch Viewer
   Loaded: loaded (/etc/systemd/system/molochviewer-selks.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2021-08-19 13:43:39 CEST; 46min ago
  Process: 5540 ExecStart=/bin/sh -c /data/moloch/bin/node viewer.js -c /data/moloch/etc/config.ini >> /data/moloch/logs/viewer.log 2>&1 (code=exited, status=1/FAILURE)
 Main PID: 5540 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
● molochpcapread-selks.service - Moloch Pcap Read
   Loaded: loaded (/etc/systemd/system/molochpcapread-selks.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Thu 2021-08-19 13:43:38 CEST; 46min ago
  Process: 5537 ExecStart=/bin/sh -c /data/moloch/bin/moloch-capture -c /data/moloch/etc/config.ini -m --copy --delete -R /data/nsm/  >> /data/moloch/logs/capture.log 2>&1 (code=exited, status=1/FAILURE)
 Main PID: 5537 (code=exited, status=1/FAILURE)

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
scirius                          RUNNING   pid 5082, uptime 0:55:02
ii  elasticsearch                   7.13.4                       amd64        Distributed RESTful search engine built for the cloud
ii  elasticsearch-curator           5.8.4                        amd64        Have indices in Elasticsearch? This is the tool for you!\n\nLike a museum curator manages the exhibits and collections on display, \nElasticsearch Curator helps you curate, or manage your indices.
ii  evebox                          1:0.14.0                     amd64        no description given
ii  kibana                          7.13.4                       amd64        Explore and visualize your Elasticsearch data
ii  kibana-dashboards-stamus        2020122001                   amd64        Kibana 6 dashboard templates.
ii  logstash                        1:7.13.4-1                   amd64        An extensible logging pipeline
ii  moloch                          3.0.0-1                      amd64        Moloch Full Packet System
ii  scirius                         3.5.0-3                      amd64        Django application to manage Suricata ruleset
ii  suricata                        1:2021052601-0stamus0        amd64        Suricata open source multi-thread IDS/IPS/NSM system.
Sys. de fichiers Type     Taille Utilisé Dispo Uti% Monté sur
udev             devtmpfs    32G       0   32G   0% /dev
tmpfs            tmpfs      6,3G    591M  5,7G  10% /run
/dev/md1         ext4       1,8T    829G  911G  48% /
tmpfs            tmpfs       32G       0   32G   0% /dev/shm
tmpfs            tmpfs      5,0M       0  5,0M   0% /run/lock
tmpfs            tmpfs       32G       0   32G   0% /sys/fs/cgroup
/dev/md0         ext4       463M     81M  354M  19% /boot
tmpfs            tmpfs      6,3G       0  6,3G   0% /run/user/1000

On Selks after some days: (empty) image

eglyn commented 3 years ago

And on Moloch URL I have: MaxRetryError at /moloch/ HTTPConnectionPool(host='localhost', port=8005): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3c78854c50>: Failed to establish a new connection: [Errno 111] Connection refused',))

pevma commented 3 years ago

Just to double check - Did the first time setup finished without a problem? (https://github.com/StamusNetworks/SELKS/wiki/First-time-setup)

Also noticed you could upgrade (post QA test :) ) (https://github.com/StamusNetworks/SELKS/wiki/SELKS-upgrades)

pevma commented 3 years ago

It could also be related to disk filing up ?

eglyn commented 3 years ago

Yes the first time setup finished great, Selks works great for some days before crashing. I do all update, it update some app and packets, but same issue.

The disk is not full, but it reached the limit of the moloch config (setup in config.ini), but there is a logrotate I suppose ^^

eglyn commented 3 years ago

I have this on eleastic search info: image

eglyn commented 3 years ago

Maybe it is an issue with suricata, It stuck on "Fetching data": image

pevma commented 3 years ago

If it does this once every 2 days or so - it can help to do a health check when it actually happens - could be easier to troubleshoot. Did you do an upgrade ?

eglyn commented 3 years ago

Yes I upgrade it, no change. It actually happens now ^^ but health check just show 2 moloch services down.

pevma commented 3 years ago

From the report it seems you have 3.5.0-3 running , the current stable is 3.7.0-6 , hence my note about upgrading.

pevma commented 3 years ago

Just noticed too that you are running the latest Moloch (3.0) so might be some errs in the logs, might be related to that upgrade path.

eglyn commented 3 years ago

From the report it seems you have 3.5.0-3 running , the current stable is 3.7.0-6 , hence my note about upgrading.

That's weird, I already launched the update with sudo selks-upgrade_stamus.

And it stays at 3.5.0-3 :/

pevma commented 3 years ago

What is the output of: cat /etc/apt/sources.list.d/selks5.list

eglyn commented 3 years ago

What is the output of: cat /etc/apt/sources.list.d/selks5.list

I does not have any selks5, but a selks6.list:

deb http://packages.stamus-networks.com/selks6/debian/ buster main
deb http://packages.stamus-networks.com/selks6/debian-kernel/ buster main
deb http://packages.stamus-networks.com/selks6/debian-test/ buster main
eglyn commented 3 years ago

Just noticed too that you are running the latest Moloch (3.0) so might be some errs in the logs, might be related to that upgrade path.

I have this errors in viewer.log:

"rest_total_hits_as_int": true
} err: ResponseError: index_not_found_exception
    at onBody (/data/moloch/node_modules/@elastic/elasticsearch/lib/Transport.js:311:23)
    at IncomingMessage.onEnd (/data/moloch/node_modules/@elastic/elasticsearch/lib/Transport.js:240:11)
    at IncomingMessage.emit (events.js:412:35)
    at endReadableNT (internal/streams/readable.js:1317:12)
    at processTicksAndRejections (internal/process/task_queues.js:82:21) {
  meta: {
    body: { error: [Object], status: 404 },
    statusCode: 404,

And in the capture.log:

ug 20 09:19:41 http.c:306 moloch_http_send_sync(): 1/1 SYNC 404 http://localhost:9200/_template/arkime_sessions3_template?filter_path=**._meta 0/2 0ms 2ms
Aug 20 09:19:41 db.c:2054 moloch_db_check(): ERROR - Couldn't load version information, database might be down or out of date.  Run "db/db.pl host:port upgrade"
Aug 20 09:21:11 main.c:202 parse_args(): WARNING: gethostname doesn't return a fully qualified name and getdomainname failed, this may cause issues when viewing pcaps, use the --host option - SERVERNAME
eglyn commented 3 years ago

If I launch stamus upgrade I have:

NOTE:
Depending on the size and how busy the system is the upgrade may take a while.
Starting the upgrade sequence...

Atteint :1 http://security.debian.org/debian-security buster/updates InRelease
Atteint :2 https://artifacts.elastic.co/packages/7.x/apt stable InRelease
Atteint :3 http://packages.stamus-networks.com/selks6/debian buster InRelease
Atteint :5 https://packages.elastic.co/curator/5/debian9 stable InRelease
Atteint :6 http://packages.stamus-networks.com/selks6/debian-kernel buster InRelease
Atteint :7 http://packages.stamus-networks.com/selks6/debian-test buster InRelease
Atteint :4 https://files.evebox.org/evebox/debian stable InRelease
Lecture des listes de paquets... Fait
Lecture des listes de paquets... Fait
Construction de l'arbre des dépendances
Lecture des informations d'état... Fait
selks-scripts-stamus est déjà la version la plus récente (2020121401).
0 mis à jour, 0 nouvellement installés, 0 à enlever et 1 non mis à jour.
NOTE:
Starting second stage upgrade sequence...

outputs.7.pcap-log.enabled = yes
Atteint :1 http://security.debian.org/debian-security buster/updates InRelease
Atteint :2 https://artifacts.elastic.co/packages/7.x/apt stable InRelease
Atteint :3 http://packages.stamus-networks.com/selks6/debian buster InRelease
Atteint :5 https://packages.elastic.co/curator/5/debian9 stable InRelease
Atteint :6 http://packages.stamus-networks.com/selks6/debian-kernel buster InRelease
Atteint :7 http://packages.stamus-networks.com/selks6/debian-test buster InRelease
Atteint :4 https://files.evebox.org/evebox/debian stable InRelease
Lecture des listes de paquets... Fait
Lecture des listes de paquets... Fait
Construction de l'arbre des dépendances
Lecture des informations d'état... Fait
Calcul de la mise à jour... Fait
0 mis à jour, 0 nouvellement installés, 0 à enlever et 0 non mis à jour.
scirius: stopped
scirius: started

And it stays at 3.0.5-3

If I check with apt list --upgradable I have:

scirius/inconnu,inconnu 3.7.0-6 amd64 [pouvant être mis à jour depuis : 3.5.0-3]

But If I try to upgrade with apt I have (without validate) I have:

Lecture des listes de paquets... Fait
Construction de l'arbre des dépendances
Lecture des informations d'état... Fait
Calcul de la mise à jour... Fait
0 mis à jour, 0 nouvellement installés, 0 à enlever et 0 non mis à jour.
regit commented 3 years ago

Hello, did you do an apt upgrade or an apt dist-upgrade ?

eglyn commented 3 years ago

Hello, did you do an apt upgrade or an apt dist-upgrade ?

No, I only use selks-upgrade_stamus

eglyn commented 3 years ago

If I try to go to /kibana url I have: HTTPConnectionPool(host='localhost', port=5601): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f3c205fc2b0>: Failed to establish a new connection: [Errno 111] Connection refused'))

pevma commented 3 years ago

Can you try apt-get upgrade only ?

eglyn commented 3 years ago

Can you try apt-get upgrade only ?

I success to upgrade scirius to 3.7.0-6, i have to change my source.list config, and it works with selks-upgrade_stamus.

But it change nothing, molochpcapread-selks.service does not start, kibana still have the error above and on suricata management webpage, everything is empty :/

eglyn commented 3 years ago

When I launch this command:

/data/moloch/bin/moloch-capture -c /data/moloch/etc/config.ini -m --copy --delete -R /data/nsm/

I have this error:

ERROR - Couldn't load version information, database might be down or out of date.  Run "db/db.pl host:port upgrade"

I try : db/db.pl host:port upgrade

And it says:

Couldn't PUT http://SERVER:9200/arkime_sequence_v30/_mapping?master_timeout=240s  the http status code is 404 are you sure elasticsearch is running/reachable?

Elasticsearch is running:

systemctl status elasticsearch
● elasticsearch.service - Elasticsearch
   Loaded: loaded (/lib/systemd/system/elasticsearch.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2021-08-20 10:43:21 CEST; 20min ago
     Docs: https://www.elastic.co
 Main PID: 756 (java)
    Tasks: 136 (limit: 4915)
   Memory: 38.4G
   CGroup: /system.slice/elasticsearch.service
           ├─ 756 /usr/share/elasticsearch/jdk/bin/java -Xshare:auto -Des.networkaddress.cache.ttl=60 -Des.networkaddress.cache.negative.ttl=10 -XX:+AlwaysPreTouch -Xss1m -Djava.awt.headless=true -Dfile.
           └─1116 /usr/share/elasticsearch/modules/x-pack-ml/platform/linux-x86_64/bin/controller

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

And dp.pl 127.0.01:9200 info:

./db.pl 127.0.0.1:9200 info
Cluster Name:            elasticsearch
ES Version:                     7.14.0
DB Version:                         66
ES Data Nodes:                       1/1
Sessions2 Indices:                   0
Sessions:                            0 (0 bytes)
History Indices:                     0
Histories:                           0 (0 bytes)
stats_v4:                            1 (37,157 bytes)
fields_v3:                         327 (71,845 bytes)
files_v6:                          200 (75,228 bytes)
users_v7:                            2 (8,326 bytes)
hunts_v2:                            0 (301 bytes)
dstats_v4:                       4,320 (2,559,621 bytes)
sequence_v3:                         1 (4,304 bytes)
regit commented 3 years ago

Looks like you have an HTML coming back instead of a JSON in the last test. Do you need to specify the port ?

eglyn commented 3 years ago

Looks like you have an HTML coming back instead of a JSON in the last test. Do you need to specify the port ?

You speak about db.pl 127.0.0.1:9200 upgrade ?

I think I have to put the port, it is the port of Elasticsearch, if I don't put the port, I have directly an error.

eglyn commented 3 years ago

Moloch is looking for http://SERVER:9200/arkime_sequence_v30/. Why is it looking for an index named arkime_sequence_v30 ?

eglyn commented 3 years ago

Ok, Moloch works, I have to do a db.pl 127.0.0.1 init...

And, I found another issue with kibana and elasticsearch, I was stuck to 1000 shards:

Please check the health of your Elasticsearch cluster and try again. Error: [validation_exception]: Validation Failed: 1: this action would add [2] shards, but this cluster currently has [1000]/[1000] maximum normal shards open

I increase max shard to 5000, and everything works, but is there a way to not reproduce the issue ? (stuck at 5000...)

pevma commented 3 years ago

What size of data/volume do you have? Is it still one node cluster?

eglyn commented 3 years ago

What size of data/volume do you have? Is it still one node cluster?

Disk is a 2 TB raid 1 SSD, full at 90%.

I have setup the moloch config.ini to 10% space left, 10GB max file size and 30min.

and yes I have only one node.

pevma commented 3 years ago

I that case I think ES hits the watermark i suspect - full disk ? (/avr/log/elasticsearch/elasticsearch.log) https://stackoverflow.com/questions/50609417/elasticsearch-error-cluster-block-exception-forbidden-12-index-read-only-all

If that is the case it means you generate more data fast and might need to lower the retention or use a bigger disk.

eglyn commented 3 years ago

I that case I think ES hits the watermark i suspect - full disk ? (/avr/log/elasticsearch/elasticsearch.log) https://stackoverflow.com/questions/50609417/elasticsearch-error-cluster-block-exception-forbidden-12-index-read-only-all

If that is the case it means you generate more data fast and might need to lower the retention or use a bigger disk.

I don't understand something with disk retention...

I set: maxFileSizeG = 1 maxFileTimeM = 30 freeSpaceG = 50%

But disk still saturate, 82% now.... limits does not works ? Is there another parameter to limit disk usage ?

/data folder is about 1.2TB for a 1.8TB disk. and there is 172GB in /var folder.

pevma commented 3 years ago

Where do you setup those settings ? Is it during setup for the pcap retention - in that case it is not for ES.

eglyn commented 3 years ago

In the wiki, file /data/moloch/etc/config.ini

I don't think it is elasticsearch wich use all disk space, but moloch directory (1.2TB)

pevma commented 3 years ago

Yes, this is elasticsearch reaching the watermark (80% by default) and thus switching to readonly. So maybe ES is writing in a diff volume/disk?