fermi-ad / controls

Central repo for reporting bugs, making feature requests, managing RFCs, and requesting seminar topics.
https://www-bd.fnal.gov/controls/
2 stars 0 forks source link

Erlang front-ends memory leak #16

Closed awattsFNAL closed 9 months ago

awattsFNAL commented 9 months ago

Tracking apparent memory leak in Erlang front ends following Erlang 26 and Alma Linux 9 upgrades.

rneswold commented 9 months ago

Do we know the typical lifetime before these front-ends experience the error? 10 minutes? Hours? Days?

awattsFNAL commented 9 months ago

From Keith on Slack:

CLX58E shows, according to D31 that the node has been up 5+ days. However, run_erl_fe is missing and the last entry in the Erlang log states that Erlang shutdown shortly after 29-Sept 12:12:24. Why? I'm going to leave clx58E in the state it is in case someone with more Erlang knowledge can debug. Thanks. eheap_alloc: Cannot allocate 441485240 bytes of memory (of type "heap"). Crash dump is being written to: erl_crash.dump...done [os_mon] memory supervisor port (memsup): Erlang has closed [os_mon] cpu supervisor port (cpu_sup): Erlang has closed clx58$ acnet show PID NI %CPU VSZ RSS STAT START COMMAND 1274554 0 0.1 27556 19512 Ss Sep27 acnetd 1274671 0 0.0 118424 25752 S Sep27 clockpeek 1274571 0 0.0 157540 27540 Sl Sep27 dpm 1274661 0 0.0 138316 23536 Ssl Sep27 pld 1274685 0 0.0 126184 31456 Ssl Sep27 alarms_daemon 1274647 0 0.0 4324 3260 Ss Sep27 service_watcher Missing run_erl_fe (edited)

awattsFNAL commented 9 months ago

Keith 12 minutes ago clx58 is using the llrf_driver and llrf_wave libraries. Very small configuration. Only half a dozen OIDs. I have previously received email from frontend@clx58.fnal.gov stating that system_memory_high_watermark Last email like that was 26-Sept.

awattsFNAL commented 9 months ago

Keith 3 minutes ago MCR called. clx46e also is down. looks to be same issue. eheap_alloc: Cannot allocate 367904368 bytes of memory (of type "heap").

rneswold commented 9 months ago

I found this. It's a bug in the JIT when handling binaries. Our framework does a lot of that, so we may be hitting this bug.

They said it'll be fixed in a future version. We have OTP/26.0 installed and the official, latest release is OTP/26.1.1.

Looks like we should install the latest version.

nicklausdj commented 9 months ago

I upgraded 46 yesterday (10/2) (along with 39 & 42.

rneswold commented 9 months ago

Should I install 26.1.1? Or should I ask Larry to do it?

nicklausdj commented 9 months ago

One thing that may be worth noting: 46 and 58 are currently very prolific in writing output to their logs. They died. Clx42, for instance, I also upgraded yesterday, but it isn't writing much to its log. Several erlang versions ago, I think we had a similar issue where log output leaked memory, didn't we? Clx 39 & 42 are upgraded and not very chatty to the log. Their memory footprint is 58-76 MB acc. to top. Clx41 was also upgraded by Keith last week and it's starting to use lots of memory (&CPU). I predict it'll die soon. It has wrapped through at least 10 instances of log files.

nicklausdj commented 9 months ago

Should I install 26.1.1? Or should I ask Larry to do it?

ask Larry, on 58 for now. I changed 46's startup and restarted it to leave off the beamtrail driver that was outputting all the text, cuz. nobody uses beamtrail anymore, I guess.

kengell commented 9 months ago

As for the prolific logging done by clx58: What I find in the log more than anything is something like the following

_Process <0.32835.0> get_attrsetup undefined Dvr. DI=278520, SSDN=<<21,2,0,0,0,

FWIW DI=278520 (T:1LCVPH) is a device on ACNET node CMLRFA not clx58

rneswold commented 9 months ago

Interesting observation, Dennis! It'll be interesting if 42 stays up longer (or forever 🤞🏻.)

rneswold commented 9 months ago

Looks like CMLRFA is CLX58. I think Dennis created several ACNET aliases on one of the nodes (probably CLX58) so he could have many, unrelated front-ends running.

nicklausdj commented 9 months ago

As for the prolific logging done by clx58: What I find in the log more than anything is something like the following

_Process <0.32835.0> get_attrsetup undefined Dvr. DI=278520, SSDN=<<21,2,0,0,0,

FWIW DI=278520 (T:1LCVPH) is a device on ACNET node CMLRFA not clx58

yeah, I'm working on obsoleting those. But yes, CMLRFA,B,C,&D are aliases of clx58e

nicklausdj commented 9 months ago

Screenshot 2023-10-03 at 4 58 45 PM

Doe

nicklausdj commented 9 months ago

that's a plot of the erlang memory device for clx58e showing leaking over a couple hours. Total process memory footprint is about that +40MB.

nicklausdj commented 9 months ago

Do we know the typical lifetime before these front-ends experience the error? 10 minutes? Hours? Days?

For 58 and 46 it was less than a day. 40 was converted last Thursday or so, and it's on the verge. So it depends on usage.

rneswold commented 9 months ago

Strange that it levels out for a while before stepping higher.

rneswold commented 9 months ago

If you run i() on it, do any processes have a message queue with lots of messages? That function will also show a processes' heap and stack usage.

nicklausdj commented 9 months ago

On clx58, currently running away, huge heap on user_drv:init process: <0.70.0> supervisor_bridge:user_sup/1 987 1287 0 gen_server:loop/7 11
<0.71.0> user_drv:init/1 2072833 72225518 0 user_drv gen_statem:loop_receive/3 10
<0.73.0> prim_tty:writer/1 1598 461463 0 user_drv_writer prim_tty:writer_loop/2 8

But on clx42, which is under control but still otp26, user_drv:init is big, but much smaller than that: <0.70.0> supervisor_bridge:user_sup/1 987 2801 0 gen_server:loop/7 11
<0.71.0> user_drv:init/1 46422 991781 0 user_drv gen_statem:loop_receive/3 10
<0.73.0> prim_tty:writer/1 2586 11316 0 user_drv_writer prim_tty:writer_loop/2 8

In between is clxl46. memory growing, but not as fast as clx58 <0.71.0> user_drv:init/1 1199557 33405980 0 user_drv gen_statem:loop_receive/3 10

memory footprint from top: clx42: 76MB, clx46: 146MB, clx58: 219MB

nicklausdj commented 9 months ago

no non-zero msg queue's.

kengell commented 9 months ago

Parameter page of memory usage and process count for CLX nodes where I am primary keeper.

Looks like from worst to best for memory usage:

Image

kengell commented 9 months ago
  1. I obsoleted devices on ACNET nodes CLX41E and CLX58E (including CMLRFA, B, C, D) that were returning invalid OID errors.
  2. I restarted the run_erl_fe task on both CLX41E and CLX58E.
  3. I added all the devices e.g. (U:EMEE28) listed in my previous post to the Controls data logger at 1HZ.
nicklausdj commented 9 months ago

Shoot. clx58e is now running 26.1 No signs of leaking after 20 minutes but Keith removed most of what was causing it to log. Don't touch 40! I'll upgrade it to 26.1 next.

nicklausdj commented 9 months ago

I'm running the 26.1 version on 40 but it is still leaking badly, easily observable with top. But it only has 26.1 in the /usr/local/frontend/sync area, not installed globally. I think it all runs out of that local area, but am not 100% certain it doesn't pick up something from the global path.

nicklausdj commented 9 months ago

I now have clx40, 46, 58, and 67 running 26.1. 46 has logged about 3MB worth of logs since restart, with no sign of memory leakage. 58 & 67 seem fine, but aren't logging much either since they've been cleaned up. 40 leaks badly, but isn't logging much at all, so there is something else going on there, apparently. Here's the last hour or so of memory usage of 58 (green) & 46 (red) Screenshot 2023-10-04 at 4 33 58 PM

BTW: clx_sync_master (which fills update_fe pushes) now has 26.1 compiled code (& runtime).

kengell commented 9 months ago

Overnight plot of CLX40, 41, 44, 56, 58 and 68.

All six seem stable in their memory usage - no large growth. Devices were added to the Controls data logger yesterday.

Plot Legend

Image

awattsFNAL commented 9 months ago

Move to Erlang 26.1 has fixed this issue.