erlang / otp

Erlang/OTP
http://erlang.org
Apache License 2.0
11.3k stars 2.94k forks source link

`system_memory_high_watermark` alarm set when there is more than enough memory available. #8759

Open mmzeeman opened 2 weeks ago

mmzeeman commented 2 weeks ago

It is useful to have good alarms. However, on a system with 64 gig ram, and 60 gig ram available, but with 59 gig of cached files os_mon will set the system_memory_high_watermark alarm. This is not useful.

In fact, in almost all normal usage situations this alarm is raised. Mostly because the underlying OS will use the available memory for useful things. This memory will be made available when needed by applications.

Steps to reproduce Start os_mon application on a system which has cached files. MacOS and Linux usually use all available ram for their file caches. Which will usually always raise this alarm when you start the os_mon application.

~> erl
Erlang/OTP 26 [erts-14.2.5] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Eshell V14.2.5 (press Ctrl+G to abort, type help(). for help)
1> application:ensure_all_started(os_mon).
{ok,[sasl,os_mon]}
=INFO REPORT==== 28-Aug-2024::11:08:28.213780 ===
    alarm_handler: {set,{system_memory_high_watermark,[]}}
=INFO REPORT==== 28-Aug-2024::11:08:28.214697 ===
    alarm_handler: {set,{{disk_almost_full,"/System/Volumes/Data"},[]}}
=INFO REPORT==== 28-Aug-2024::11:08:28.214781 ===
    alarm_handler: {set,
                       {{disk_almost_full,
                            "/Library/Developer/CoreSimulator/Volumes/iOS_21C62"},
                        []}}
2> alarm_handler:get_alarms().
[{{disk_almost_full,"/Library/Developer/CoreSimulator/Volumes/iOS_21C62"},
  []},
 {{disk_almost_full,"/System/Volumes/Data"},[]},
 {system_memory_high_watermark,[]}]

Expected behavior This alarm is only raised when the available memory is low.

In this case the alarm was set on a system with this memory available.

~> vm_stat
Mach Virtual Memory Statistics: (page size of 16384 bytes)
Pages free:                                5834.
Pages active:                            306439.
Pages inactive:                          302581.
Pages speculative:                          267.
Pages throttled:                              0.
Pages wired down:                        126856.
Pages purgeable:                          24446.
"Translation faults":                 212752452.
Pages copy-on-write:                    4316050.
Pages zero filled:                    116854805.
Pages reactivated:                     15066135.
Pages purged:                           6652516.
File-backed pages:                       196405.
Anonymous pages:                         412882.
Pages stored in compressor:              713012.
Pages occupied by compressor:            270528.
Decompressions:                        16682994.
Compressions:                          21721440.
Pageins:                                4452712.
Pageouts:                                111494.
Swapins:                                      0.
Swapouts:                                     0.

Which has 302581 + 5834 pages of memory available (inactive + free)

The Linux system I got this on had this memory usage stats:

7fb8c5762d220b066987246fc94cf3de8c12d109

In this case this alarm was raised because the system uses almost all available memory for its file cache.

Instead of looking at the allocated ram, memsup should look at the available system ram.

Affected versions All versions I think.

Additional context This can be platform dependent, because all platforms report memory usage differently. It might be that the c-code reporting the memory usage might need some changes to report more useful values.

PS.. I'm willing to help fix this bug.

mmzeeman commented 2 weeks ago

It might be an idea to use the available memory metric from cross platform monitoring tool. For instance, Python's psutils has something which will work. See: https://psutil.readthedocs.io/en/latest/#memory

garazdawi commented 1 week ago

memsup:get_system_memory_data/0 has a value called available_memory that I think would make a great default for this alarm (falling back to free_memory if not available). A PR changing that would be welcome.

If available_memory is not what we want, then we should either improve it to be what we want or add another key getting the data that we want.

mmzeeman commented 1 week ago

Indeed available_memory is what we want to use here. Working on a PR.