jaywcarman commented 7 months ago

Issue Description

Memory usage periodically spikes until out of memory. Eventually something crashes and a bunch of workers are killed. Things start back up and the process repeats.

Memory used (%) since last reboot on 2024-01-31 until 2024-02-15:

Environment

Appliance running in IBM Cloud VPC (16 vCPU | 160 GiB | 32 Gbps)
VERSION=4.2.0 (IM)
BUILD=20230913103031_b15dc60

Providers:

vmdb_production=# SELECT COUNT(*),type 
vmdb_production-# FROM ext_management_systems 
vmdb_production-# WHERE type LIKE '%::CloudManager' OR type LIKE '%::InfraManager'
vmdb_production-# GROUP BY type;
count |                               type                               
-------+------------------------------------------------------------------
57 | ManageIQ::Providers::IbmCloud::PowerVirtualServers::CloudManager
48 | ManageIQ::Providers::IbmCloud::VPC::CloudManager
 1 | ManageIQ::Providers::IbmPowerHmc::InfraManager
 3 | ManageIQ::Providers::IbmPowerVc::CloudManager
 1 | ManageIQ::Providers::Openstack::CloudManager
 1 | ManageIQ::Providers::Openstack::InfraManager
(6 rows)

Logs

I saved off journalctl entries for the following events. The log files are too large to upload to GitHub (also, they may contain sensitive information). I can provide access to them as-needed.

Memory usage spikes ~10% on 2024-02-08 11:10

journalctl save: journalctl_20240208105500-20240208130500.txt

Memory usage drop on 2024-02-14 00:10

journalctl save: journalctl_20240214000000-20240214003000.txt

Memory usage spikes ~10% 2024-02-15 12:55

journalctl save: journalctl_20240214125000-20240214130500.txt top_output save: top_output_20240215.log

From the top output it looks like used stays flat while buff/cache increases:


$  grep -n "^MiB Mem" top_output_20240215.log
~~snip~~
488291:MiB Mem : 160910.6 total,  37859.1 free, 104039.2 used,  19012.3 buff/cache
489176:MiB Mem : 160910.6 total,  37666.9 free, 104048.2 used,  19195.5 buff/cache
490071:MiB Mem : 160910.6 total,  34829.9 free, 104047.5 used,  22033.1 buff/cache
490963:MiB Mem : 160910.6 total,  31841.1 free, 104259.1 used,  24810.4 buff/cache
491862:MiB Mem : 160910.6 total,  29232.8 free, 104038.6 used,  27639.2 buff/cache
492757:MiB Mem : 160910.6 total,  26424.3 free, 104029.0 used,  30457.3 buff/cache
493650:MiB Mem : 160910.6 total,  23600.8 free, 104035.0 used,  33274.8 buff/cache
494550:MiB Mem : 160910.6 total,  21883.7 free, 103876.1 used,  35150.8 buff/cache
495446:MiB Mem : 160910.6 total,  21620.7 free, 103817.4 used,  35472.6 buff/cache
~~snip~~
``

kbrock commented 7 months ago

I wonder if this is due to supports_feature=true in our api queries. Those force loading all providers into memory.

Typically there are not that many providers, so this will slide through. But once we get enough providers, then forcing the filtering out of the db and into ram will start to show up.

Question:

Do we know what the server was doing when it hit an out of memory?

jaywcarman commented 7 months ago

Do we know what the server was doing when it hit an out of memory?

@kbrock I just shared an internal box folder that contains a journalctl save that includes the "crash" - journalctl_20240214000000-20240214003000.txt at around Feb 14 00:10:00.

The first OOM message I see are related to the DB connection:

$  grep -n OutOfMemory journalctl_20240214000000-20240214003000.txt
1100:Feb 14 00:02:30 rdr-cloud-usage.com evm[2580]:  WARN -- evm: MIQ(VmdbDatabaseConnection.log_statistics) Unable to log stats, 'PG::OutOfMemory: ERROR:  out of memory
1184:Feb 14 00:04:30 rdr-cloud-usage.com production[1853442]: ERROR -- production: There was an exception - ActiveRecord::StatementInvalid(PG::OutOfMemory: ERROR:  out of memory
1246:Feb 14 00:04:30 rdr-cloud-usage.com api[3789]: ERROR -- api: MIQ(Api::AuthController.api_error) ActiveRecord::StatementInvalid: PG::OutOfMemory: ERROR:  out of memory
1248:Feb 14 00:04:30 rdr-cloud-usage.com production[3789]: FATAL -- production: Error caught: [ActiveRecord::StatementInvalid] PG::OutOfMemory: ERROR:  out of memory
1979:Feb 14 00:07:41 rdr-cloud-usage.com api[807264]: ERROR -- api: MIQ(Api::AuthController.api_error) ActiveRecord::StatementInvalid: PG::OutOfMemory: ERROR:  out of memory
1986:Feb 14 00:07:41 rdr-cloud-usage.com production[807264]: FATAL -- production: Error caught: [ActiveRecord::StatementInvalid] PG::OutOfMemory: ERROR:  out of memory

agrare commented 7 months ago

From the top output it looks like used stays flat while buff/cache increases:

This is extremely odd, memory used by buffer cache shouldn't cause OOM it is supposed to be able to be freed at anytime. If there was an OOM while there was available memory that is really weird.

One thing that comes to mind, what is the filesystem usage looking like here? If you're using mmap'd memory pages backed by files and you run out of space on the filesystem it'll actually trigger a memory fault not a ENOSPC

jaywcarman commented 7 months ago

From the top output it looks like used stays flat while buff/cache increases:

This is extremely odd, memory used by buffer cache shouldn't cause OOM it is supposed to be able to be freed at anytime. If there was an OOM while there was available memory that is really weird.

I might have confused things a bit here - the top_output.log doesn't include data from when the OOM occurred, just from a subsequent memory usage jump.

what is the filesystem usage looking like here?

[root@rdr-cloud-usage ~]# df -h
Filesystem                              Size  Used Avail Use% Mounted on
devtmpfs                                 79G     0   79G   0% /dev
tmpfs                                    79G   41M   79G   1% /dev/shm
tmpfs                                    79G  1.9M   79G   1% /run
tmpfs                                    79G     0   79G   0% /sys/fs/cgroup
/dev/mapper/vg_system-lv_os              11G  6.5G  4.1G  62% /
/dev/vda1                              1014M  216M  799M  22% /boot
/dev/mapper/vg_system-lv_home          1014M   47M  968M   5% /home
/dev/mapper/vg_system-lv_var             12G  2.4G  9.7G  20% /var
/dev/mapper/vg_pg-lv_pg                 500G   26G  475G   6% /var/lib/pgsql
/dev/mapper/vg_system-lv_var_log         21G   14G  7.1G  67% /var/log
/dev/mapper/vg_system-lv_var_log_audit  507M   66M  441M  13% /var/log/audit
/dev/mapper/vg_system-lv_tmp           1014M   40M  975M   4% /tmp
tmpfs                                    16G     0   16G   0% /run/user/0

agrare commented 7 months ago

One other thing that comes to mind is I've heard of some issues on linux (specifically with ZFS on linux since ZFS makes heavy use of in-memory ARC cache) with not being able to free cached memory "fast enough" to prevent an OOM.

This is assuming there was still available memory when the OOM hit which we don't know.

miq-bot commented 3 months ago

This issue has been automatically marked as stale because it has not been updated for at least 3 months.

If you can still reproduce this issue on the current release or on master, please reply with all of the information you have about it in order to keep the issue open.

ManageIQ / manageiq

Memory usage spikes on appliance with 100+ providers #22900

Issue Description

Memory used (%) since last reboot on 2024-01-31 until 2024-02-15:

Environment

Logs

Memory usage spikes ~10% on 2024-02-08 11:10

Memory usage drop on 2024-02-14 00:10

Memory usage spikes ~10% 2024-02-15 12:55