AllStarLink / app_rpt

Refactoring and upgrade of AllStarLink's app_rpt, etc.
8 stars 6 forks source link

SCRN Coredump 2023-07-17 #196

Open tsawyer opened 1 year ago

tsawyer commented 1 year ago

From the log at time of crash. The last message before the restart is Echolink related.

[2023-07-17 14:27:58.775] VERBOSE[1933005] chan_voter.c: Voter client 502-Ecuador disconnect (timeout)
[2023-07-17 14:28:07.775] VERBOSE[1933005] chan_voter.c: Voter client 502-Ecuador disconnect (timeout)
[2023-07-17 14:28:22.792] VERBOSE[1933005] chan_voter.c: Voter client 502-Ecuador disconnect (timeout)
[2023-07-17 14:28:24.425] WARNING[2110925] app_rpt.c: statpost to URL 'http://stats.allstarlink.org/uhandler?node=42781&
time=1689629141&seqno=19620&keyed=0&keytime=328149' failed with code 0
[2023-07-17 14:28:33.312] VERBOSE[1933005] chan_voter.c: Voter client 502-Ecuador disconnect (timeout)
[2023-07-17 14:28:36.052] VERBOSE[2110242] manager.c: Manager 'admin' logged off from 3.19.110.106
[2023-07-17 14:28:39.312] VERBOSE[1933005] chan_voter.c: Voter client 502-Ecuador disconnect (timeout)
[2023-07-17 14:28:40.153] VERBOSE[2110244] manager.c: Manager 'admin' logged off from 3.19.110.106
[2023-07-17 14:28:42.830] ERROR[1933003] chan_echolink.c: Unable to connect to directory server server1.echolink.org.
[2023-07-17 14:29:13.619] Asterisk 20.1.0 built by root @ SCRN-ASL3 on a x86_64 running Linux on 2023-02-25 14:55:02 UTC
[2023-07-17 14:29:13.619] VERBOSE[2111025] message.c: Message handler 'dialplan' registered.

core-asterisk-2023-07-17T21-28-43Z-full.txt

InterLinked1 commented 1 year ago

Possible regression introduced by commit 29db2edcde09a27e92029820631b2089f8ae14e7, since that's where the code crashed.

It looks like a possible double free or free of memory that Asterisk did not allocate. @KB4MDD any thoughts since you likely understand this the best?

KB4MDD commented 1 year ago

I am testing code for this fix now. I have identified one issue the prevented a node from being deleted from one of the internal trees. I also combined two routines into one.

InterLinked1 commented 1 year ago

@KB4MDD Thanks Danny. Curious, were you able to reliably reproduce the segfault Tim was seeing?

I don't know how easy it would be to do, but maybe we should start thinking about adding tests to the extent possible as we go. There is already an issue open for this, but rather than waiting until later, if we can develop tests now to continue to prevent regressions in the future then that will probably play a lot of divdends.

Could chan_echolink run in a VM without any hardware connected, in some kind of way that could be tested here? (Probably using the Asterisk test suite, not a unit test).

KB4MDD commented 1 year ago

@InterLinked1 I have not been able to reproduce the fault. I however did discover some issues in the code that prevented one of the internal binary trees to not be deleted properly. I have reworked that code and it works properly. I ran that code for 4 days with no problem.

The issue occurred when the echolink server sent a full list. It is unknown to me how often that occurs. The double free occurred because the database entry had been freed, but not removed from the binary tree.

chan_echolink can run without any hardware. The loading and maintenance of the internal tree can work if you have a valid account with echolink.

Changes will be pushed soon.

tsawyer commented 1 year ago

When first testing this fix I thought someone replying to my Echolink transmissions, but there was no audio. Then I realized that after each of my transmissions there would be a second or two delay follow by a brief carrier on the repeater's output.

When I disconnected Echolink the server core dumped and restarted. It has not shown the behavior described or core dumped since restarting. I did restart the server with core restart after updating. Perhaps some old bit of Echolink was still around after a restart? Is it safer to core stop now? My home node 2509 has also been updated but has not had this problem.

I don't seen anything in the log mentioning Echolink. I've included the log lines since the update in case maybe you can see something I missed.

messages.log core-asterisk-2023-08-03T18-02-53Z-full.txt

tsawyer commented 1 year ago

By the way, I changed the hostname of this server. It is now Angel-ASL3. We moved SCRN to a new server to separate the two groups and lessen the exposure to Echolink for SCRN who is not using it. This happened three days ago.

KB4MDD commented 1 year ago

According to the original crash information, it was echolink related. It happened when it attempted to get the directory.

The latest dump has to do with freeing an ast_frame in ast_write_stream, when called from app_rpt.

KB4MDD commented 1 year ago

For review: The problem originated at ast_write in this code in app_rpt

        if (who == myrpt->pchannel) {   /* if it was a read from pseudo */
            f = ast_read(myrpt->pchannel);
            if (!f) {
                ast_debug(1, "@@@@ rpt:Hung Up\n");
                break;
            }
            if (f->frametype == AST_FRAME_VOICE) {
                if (!myrpt->localoverride) {
                    ast_write(myrpt->txpchannel, f);
                }