ElvishArtisan / rivendell

A full-featured radio automation system targeted for use in professional broadcast and media environments
200 stars 64 forks source link

RDLogManager / RDLogEdit getting hung #865

Open alexolivan opened 1 year ago

alexolivan commented 1 year ago

Hi!

I've noticed, almost one per week (my guess that may depend on system load or particular environment/circumstances), that for some reason some kind of backend service gets to a state that renders both RDLogEdit GUI and rdlogmanager command like useless.

An early sympton is that, while RDairplay, RDcatch, etc. are working (so nothing to suspect rivendell service crashed), RDlogedit GUI module launches like if completely unable to grab data/info from backend: Screenshot at 2023-01-16 17-06-58

Finding RDAirplay in the morning stopped, without log loaded, made me find that my 'cronified' rdlogmanager commands used to generate daily logs got hung... The processes got up forever and had to be killed. Any atempt of using rdlogmanager command leads to the same forever hung state.

Restarting the whole rivendell.service fixes the issue... so my bet is that, much like i found that rdcatchd could crash on its own while the rest of stuff keeps working, a similar thing may be happening there. I don't know but, like I did with rdcatchd, maybe I can run under gdb the asociated daemon and try to get a backtrace when the issue happens.

Any ideas? Cheers

ElvishArtisan commented 1 year ago

This one may be tough to run down. It could be anything, including stuff in your larger environment that has no direct relation to Rivendell.

I don't know but, like I did with rdcatchd, maybe I can run under gdb the asociated daemon and try to get a backtrace when the issue happens.

Backtraces are always helpful! Also, when you find the system in this state, check to see if the primary Rivendell daemons are still running. You can do this by running the following commands: ps ax | grep caed ps ax | grep ripcd ps ax | grep rdcatchd For each of these, you should get back something like: 28657 ? SLl 1:25 /usr/sbin/caed N.B. You may also get back a line from the grep(1) command itself, like: 29688 pts/3 S+ 0:00 grep --color=auto caed If it has the word 'grep' in it, it's a false positive and should be ignored.

alexolivan commented 1 year ago

OK... today morning I've found Rivendell out of logs. Also, an rdcatch upload was hung.

In order to reduce the chances of the issue being related to a 'too high' system load, I scattered all uploading events of weekend recordings across the week nights, so the 'load' is as evenly distributed as possible, but I'm not sure at all system load being a definite root cause, since it is not that high and, after all, this mockup/VM has been running already for more than a year and log generation has been rock solid even under the hardest conditions. Here we go with the available info:

Apparently, rdlogmanager commands freezed yesterday and passed unnoticed, with tonight runs getting stuck as well:

root     3532233  0.0  0.0   2484   516 ?        Ss   Jan23   0:00 /bin/sh -c /usr/bin/rdlogmanager -g -s Traffic 2>&1 | /usr/bin/logger -t rdlog-generate
root     3532234  0.3  0.6 194164 32868 ?        S    Jan23   6:12 /usr/bin/rdlogmanager -g -s Traffic
root     3532235  0.0  0.0   8772  1048 ?        S    Jan23   0:00 /usr/bin/logger -t rdlog-generate
root     3532253  0.0  0.0   2484   580 ?        Ss   Jan23   0:00 /bin/sh -c /usr/bin/rdlogmanager -g -s Musica 2>&1 | /usr/bin/logger -t rdlog-generate
root     3532256  0.3  0.6 194164 33360 ?        S    Jan23   5:57 /usr/bin/rdlogmanager -g -s Musica
root     3532257  0.0  0.0   8772  1116 ?        S    Jan23   0:00 /usr/bin/logger -t rdlog-generate
root     3667014  0.0  0.0   2484   516 ?        Ss   03:10   0:00 /bin/sh -c /usr/bin/rdlogmanager -g -s Traffic 2>&1 | /usr/bin/logger -t rdlog-generate
root     3667017  0.3  0.6 194164 33248 ?        S    03:10   1:02 /usr/bin/rdlogmanager -g -s Traffic
root     3667018  0.0  0.0   8772  1048 ?        S    03:10   0:00 /usr/bin/logger -t rdlog-generate
root     3667890  0.0  0.0   2484   516 ?        Ss   03:15   0:00 /bin/sh -c /usr/bin/rdlogmanager -g -s Musica 2>&1 | /usr/bin/logger -t rdlog-generate
root     3667892  0.3  0.6 194164 32932 ?        S    03:15   1:00 /usr/bin/rdlogmanager -g -s Musica
root     3667893  0.0  0.0   8772  1048 ?        S    03:15   0:00 /usr/bin/logger -t rdlog-generate
root     3697149  0.0  0.0   6244   644 pts/0    S+   08:47   0:00 grep rdlog

Time for checking the services:

Manually killing those rdlogmanager tasks, issuing a restart rivendell.service and manually generating missing logs put everything back to bussiness, however I expect this happening again somewhere in around a week time frame.

Could I execute ripcd under gdb as I did with rdcatchd? or what could I do to help?

Cheers.

ElvishArtisan commented 1 year ago

Could I execute ripcd under gdb as I did with rdcatchd? or what could I do to help?

Yes. A backtrace from ripcd(8) should allow us to get to the bottom of things.

alexolivan commented 1 year ago

I'm on it... yesterday the issue happened again, and I started ripcd under gdb, but the remaining rivendell daemon ecosystem felt quite unstable. As I upgraded to last commit, rivendell restarted and everything went to normal, but after having a more close-up monitoring (Munin graph) on the system I have a clear picture of the root cause (or at least I think I've) on the issue: memory.

During the week, memory usage keeps nice and flat as the VM is basically 'airing' and generating logs... but as the weekend comes, and more and more recordings/uploads are done, memory goes up and swap area grows until it gets unstable.

Screenshot at 2023-01-30 13-43-46

I tend to use cronjobs on VMs to periodically clear caches, inodes, etc... but here its the 'apps' memory usage what grows bigger and bigger. I have reduced swappiness on the vm and increased the cache preassure, but I have to keep an eye on what the memory usage is for each of the running processes along time, to discover what happens here.

Cheers.

ElvishArtisan commented 1 year ago

Ah. Classic resource leak.

So, the big question: which process is hoarding all that memory? To see, run 'top'. The 'VIRT' column will show you how much total memory (physical + swap) each process is using. I suspect that you'll find one line that has a dramatically larger value there than any of the others.

alexolivan commented 1 year ago

Hi.... I'm on it... I'm trying to setup a per-process munin plugin to try to track the most relevant processes (the 'usual suspects' so to speak) memory usage, to have them nicely graphed from now on. I also will manually try to keep an eye on top/htop on the console to figure what... I think I'm going to get it :-)

Will report!

EDIT: Preliminary graphing shows caed memory usage as a ramp-up usage figure. Still early to grab conclussions, but looks good how tonight's upload of a weekend recording spiked briefly rdcatchd mem usage but returned to idle mem usage afterwards, however, caed memory usage seems to constantly grow... Screenshot at 2023-01-31 11-36-51

my bet is it will climb up during recordings to never free up mem... will report

alexolivan commented 1 year ago

OK, there it goes... the night of thursday to friday, first recordings and uploads, and there it goes: Screenshot at 2023-02-03 09-16-26

ripcd memory usage skyrockets to 1.3GB... Sure, I'm far from knowing the details on its actual tasks but, for an inter-process-communication daemon, it sounds as if something is wrong here.

caed, meanwhile, keeps ramping up, steadily but surely... recordings and uploads do not alter its growth, but seems to me that if the trend keeps this way, although not as severe, in the long run it would also end up using too much memory.

EDIT: As I upgraded packages to deploy last nightly builds, rivendell service restarted, and it's quite nice to see the effect reflected on memory usage graphs as every daemon goes back to initial memory usage: Screenshot at 2023-02-03 14-41-48

Cheers.

ElvishArtisan commented 1 year ago

Hmm ... unable to reproduce these results here.

This is using JACK I take it? Can you tell me a bit about the "recordings and uploads" that trigger this? (A couple of screen shots of their parameters from rdcatch(1) would be great).

dzkahn commented 1 year ago

I have experienced the blank rdlogedit and rdcatch a couple of times as well but I can not reproduce it at the moment, I don't know what the cause is.

alexolivan commented 1 year ago

This is using JACK I take it? Can you tell me a bit about the "recordings and uploads" that trigger this? (A couple of screen shots of their parameters from rdcatch(1) would be great).

Hi Fred.... yes, it's ALL jack... No ASI cards, no ALSA cards, no hardware at all. It's the very same VM I've been keeping on the testing as throughly as I could (I'm unable to field hardware-based testing). The recordings and uploads are the very same ones, that upon trial and error, I ended up realizing that normalization lead to rdcatchd to crash, and beyond 3 hour-lenght lead to empty/silent podcast uploads.

This is the first time, in months, the VM 'survives' Sunday to Monday night (thanks to Friday restart). Today I've upgraded again to latest commit so restarting will return memory usages back to idle. Here's how recordings, during weekend 'shows' lead to memory usage ramp: Screenshot at 2023-02-06 09-19-31

A noticeable event happened after Saturday to Sunday night recordings (after the last 'show' ended, and before uploads start) where it seems ripcd memory usage was 'moved' to Swap, as swap usage and ripcd ram usage do swap by comparing process memory usage graph en the RAM usage graph: Screenshot at 2023-02-06 09-20-06

Regarding caed, it appears to me that it seems to behave quite nicely once it gets memory enough commited and 'flattens' with time. Also, uploads, seems to not be a cause of problems at all. Yes, they're resorcce intensive as the transcoding happens, but with enough CPU power, their RAM usage fingerprint is barely noticeable.

So my bet is that there's some kind of leak that may develop, on ripcd, during the conditions of an rdcatch scheduled recording (eventually under JACKd environment)

Cheers.

ElvishArtisan commented 1 year ago

The recordings and uploads are the very same ones, that upon trial and error, I ended up realizing that normalization lead to rdcatchd to crash, and beyond 3 hour-lenght lead to empty/silent podcast uploads.

What is the longest duration record event on your system. What are its parameters?

alexolivan commented 1 year ago

Hi!

Longest record is just under 3 hours. I have 1, 2 and 3 hour recordings (roughly, since recordings do not start at o'clock). Here's the rdcatch event on one of the 3 hour ones: Screenshot at 2023-02-06 15-26-43

I had initially a weekly 5 hour record until i found that the corresponding uploads never went beyond 90 minutes (roughly). I tried shorter shots until I found 3 hours being the 'limit' for recording + upload to work (I opened an issue, but we concluded this is some kind of internal limitation and its closed). Anyways, I guess you need the settings of RDLibrary to know more about parameters while storing audio during the recording: Screenshot at 2023-02-06 15-33-23

Let me know if you need further configuration details.

Cheers.

ElvishArtisan commented 1 year ago

Perfect. Thank you.

alexolivan commented 1 year ago

Hi again.

Before the scheduled recordings that start tonight, I've been observing the graphs on memory usage on the processes during this week days, and, without any recordings that exacerbate the ram usage problem, I see a trend on ripcd to 'leak' (if we could talk about leaking here) around 10 MB of RAM per day: Screenshot at 2023-02-09 08-54-43

Is not much on today's systems. A system could run for months before a noticeable abnormal RAM usage happening. Probably any system would have some kind of restart because some other reason that would restart the usage, so it would pass unnoticed... however, I'm thinking that if such a leak does exists, in a 'just playback' scenario, maybe it could be an opportunity to get a clue on why or how does the problem happens so bad during recordings.

Cheers.

ElvishArtisan commented 1 year ago

Any new info on this? This is one of the very last issues left before we're ready for a production release. I've still not been able to goose it here at all.

alexolivan commented 1 year ago

Hi Fred.

No news... I guess this one is a very tough one and the production release will have to go with this one within.

Graphs do repeat week after week: Screenshot at 2023-03-01 09-27-59

The behaviour of ripcd, is straight and consistent, showing that, given the right circumstances, there's a memory leak somewhere, being specially acute during recordings. I've checked those situations manually, from CLI, before restarting Rivendell, by using top/htop, just to trust that data being gathered by the munin-node plugin was true... and there were no discrepancy. I've seen the memory usage grow bigger and bigger. And I can't think of anything regarding system, memory, virtualization, or something (I got ballooning in mind for instance) that could make just a certain process, always the same one, to behave like that.

Probably, most of the users, under totally different use-case, hardware, media configuration, capture configuration... would never have a grasp of all this. At the most, a very slight climb on memory usage of ripcd could become troublesome only after a very long time on a modern 16GB - 32GB ram system.

However, I wouldn't close this one, but keep it in mind, since after many systems would eventually upgrade tho RD4, the chances of people experiencing problems or a decrease in smoothness with time would increase, and having this in mind could be helpfull to troubleshoot by just keeping an eye on RAM usage.

Cheers!

ElvishArtisan commented 1 year ago

Yes, we'll keep this open. We'll just have to wait for more data to come in.

I appreciate your sticking with this!

ElvishArtisan commented 1 year ago

Any news to report on this?

alexolivan commented 1 year ago

Hi Fred!

For the purpose of consistency, I keep monitoring the system, and sure, the issue is there: I need to restart rivendell every Friday. As time goes on, graphs get more info, and you can see the behaviour repeating from february, when I set the monitoring:

Screenshot at 2023-06-22 09-02-13

Last weekend I did a hardware upgrade on the hypervisor, so that's the gap on the graph, and also, some weekends I had no time to restart on Friday, so I did on saturday, and thats why ripcd memory usage goes higher than usual. Linux kernel reacts using swap, and does its best to keep RAM free, but eventually, processes get OOMs. This is the RAM usage graphing, where we can see that the commited memory starts skyrocketing somewhere around mid December... it took me some weekends to realize something was wrong, diagnose the root cause, and that's why I started graphing per-process around a month later:

Screenshot at 2023-06-22 09-08-02

I have not tested a similar setup of recording on a hardware based install... suspicion on the usage of a Jackd-only virtual system being a problem is definitely there. However, my bet was that, in such case, caed, the actual sound engine (or so I think somehow about it) would have been the ptroblem, but never ripcd, which of it I think it is an inter-process comunication daemon. Right now, my speculations are that, during recordings on this all-software system, a huge amount of messaging and/or data is being handled by ripcd, and somehow gets stored on RAM, or maybe leaks... maybe there's a ton of error messaging, maybe a ton of info messaging, maybe even pieces of actual recorded data is handled by ripcd between some daemons... I don't know... but chances are that, now with 4.0.1 in production, all this pesky issues with RDCatch recordings and uploads that I've been experinced will show up more clearly.

Definitely this one is quite a pesky one!

Thank you. Cheers