crash in stream_write callback

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Run latest FreeSWITCH build with mod_unimrcp
2. Using MRCPv1 with Nuance Speech Server 5.0
3. Run about 500-700 TTS requests over a few hours
4. stream_write() callback is made with corrupted stream->obj data

Please use labels and text to provide additional information.

I have FSMOD-39 (http://jira.freeswitch.org/browse/FSMOD-39) opened to
track this in FreeSWITCH.  Still not sure whether the module or UniMRCP is
causing the problem.  I suspect it is mod_unimrcp's problem since it is
stream->obj that is corrupted, but I opened this ticket in case anyone has
any ideas of what might cause the problem.

Here's a backtrace:
(gdb) bt
#0  0x0000003e99008950 in pthread_mutex_lock () from /lib64/libpthread.so.0
#1  0x00002aaab55dfdcb in speech_channel_write (schannel=0x2aaab01a2ea0,
data=0x2aaaac1a5568, len=0x416b6010) at mod_unimrcp.c:1379
#2  0x00002aaab55dfede in synth_stream_write (stream=<value optimized out>,
frame=<value optimized out>) at mod_unimrcp.c:1882
#3  0x00002aaab5626fec in mpf_bridge_process (object=<value optimized out>)
at src/mpf_bridge.c:52
#4  0x00002aaab55f591e in mpf_context_process (context=0x2aaaac1a4330) at
src/mpf_context.c:432
#5  0x00002aaab55f5959 in mpf_context_factory_process (factory=0x101cf950)
at src/mpf_context.c:103
#6  0x00002aaab55f5610 in timer_thread_proc (thread=0x101cc1d0, data=<value
optimized out>) at src/mpf_scheduler.c:207
#7  0x0000003e990064a7 in start_thread () from /lib64/libpthread.so.0
#8  0x0000003e984d3c2d in clone () from /lib64/libc.so.6

(gdb) f 1
#1  0x00002aaab55dfdcb in speech_channel_write (schannel=0x2aaab01a2ea0,
data=0x2aaaac1a5568, len=0x416b6010) at mod_unimrcp.c:1379
1379    mod_unimrcp.c: No such file or directory.
        in mod_unimrcp.c
(gdb) p *len
$1 = 160
(gdb) p *((char *)data)
$2 = 0 '\0'
(gdb) p *schannel
$3 = {name = 0x6f732e646f72702e <Address 0x6f732e646f72702e out of bounds>,
profile = 0x2e65746962646e75, type = 980250467,
  application = 0x736568746e79732f, unimrcp_session = 0x5354522072657a69,
unimrcp_channel = 0x430a0d302e312f50,
  memory_pool = 0xa0d31203a716553, mutex = 0x726f70736e617254, cond =
0x412f505452203a74, state = 1966821462,
  audio_queue = 0x6e65696c633b7473, channel_opened = 1869635444, rate =
29810, silence = 758395445, params = 0x6e65746e6f430a0d,
  data = 0x203a657079542d74}
(gdb) p *schannel->unimrcp_channel
Cannot access memory at address 0x430a0d302e312f50
(gdb) p *schannel->unimrcp_session
Cannot access memory at address 0x5354522072657a69

Original issue reported on code.google.com by cmrie...@gmail.com on 4 Mar 2010 at 4:04

GoogleCodeExporter commented 9 years ago

FYI this is with UniMRCP r1401.

Original comment by cmrie...@gmail.com on 4 Mar 2010 at 4:11

GoogleCodeExporter commented 9 years ago

I noticed the app is much more stable when I disabled RTCP.  It has processed 
1600
calls and 986 TTS requests with no crashes.

This reminds me of the first crash.  I noticed some mpf_timer activity going on 
after
a session was terminated.  Perhaps something didn't get cleaned up completely?

Original comment by cmrie...@gmail.com on 4 Mar 2010 at 7:44

GoogleCodeExporter commented 9 years ago

Chris, thanks for the detailed analyzes.

I've not got a chance to look into this closer yet, but just wanted to suggest 
the
same to disable RTCP. If it helps, then either the problem is in processing of 
RTCP
messages or in timers. May I suggest to enable RTCP back, but disable RTCP 
receiver
by setting rx-resolution to 0
    <param name="rtcp-rx-resolution" value="0"/>
If the problem comes back, then the issue is in timers. 
I'm about to finalize my changes in trunk soon. Then, I'm planning to perform 
several
stress tests and will try with FS either.

Original comment by achalo...@gmail.com on 4 Mar 2010 at 8:25

Added labels: Component-FreeSWITCH, OpSys-All, Priority-Medium, Type-Defect

GoogleCodeExporter commented 9 years ago

So, it just crashed again without the RTCP.  I guess I try again with valgrind 
running.

Original comment by cmrie...@gmail.com on 4 Mar 2010 at 8:46

GoogleCodeExporter commented 9 years ago

Then, the problem could be in the same timeouts discussed in Issue-72.

I've noticed your comment on Jira
> 1. The callback is not expected as the MRCP session was destroyed

If callback is being called, it means session isn't destroyed yet internally in
client stack.

Original comment by achalo...@gmail.com on 4 Mar 2010 at 8:54

GoogleCodeExporter commented 9 years ago

I found something fishy on the FS side... looks like a call terminates without
cleaning up the session.  I believe this is the cause of the problem since my
stream->obj was allocated off of the call's pool.

Original comment by cmrie...@gmail.com on 5 Mar 2010 at 1:04

GoogleCodeExporter commented 9 years ago

The same with me. I've managed to reproduce it 2 times today. I made about 4000 
calls
with SIPp -> FreeSWITCH -> UniMRCPServer and got the same backtrace. Analyzing 
logs,
I found a few suspicious things. See below filtered out, TTS-9973 related part.

 4098010  2010-03-05 16:47:48.144169 [DEBUG] mod_unimrcp.c:622 (TTS-9973) audio queue
created
 4098011  2010-03-05 16:47:48.144169 [NOTICE] mrcp_client.c:549 Create MRCP Handle
0x84a82b0 [unimrcpserver-mrcp1]
 4098012  2010-03-05 16:47:48.144169 [INFO] mrcp_client_session.c:142 Create Channel
0x84a82b0 <new>
 4098015  2010-03-05 16:47:48.144169 [INFO] mrcp_client_session.c:398 Receive App
Request 0x84a82b0 <new> [2]
 4098017  2010-03-05 16:47:48.144169 [INFO] mrcp_client.c:901 Add MRCP Handle 0x84a82b0
 4098018  2010-03-05 16:47:48.144169 [DEBUG] mrcp_client_session.c:1203 Dispatch
Application Request 0x84a82b0 <new> [2]
 4098019  2010-03-05 16:47:48.144169 [NOTICE] mrcp_client_session.c:718 Add Control
Channel 0x84a82b0 <new@speechsynth>
 4098020  2010-03-05 16:47:48.144169 [DEBUG] mrcp_client_session.c:762 Add RTP
Termination 0x84a82b0 <new>
 4098221  2010-03-05 16:47:48.184411 [DEBUG] mrcp_client_session.c:1039 On
Termination Add 0x84a82b0 <new>
 4098222  2010-03-05 16:47:48.184411 [DEBUG] mrcp_client_session.c:1039 On
Termination Add 0x84a82b0 <new>
 4098223  2010-03-05 16:47:48.184411 [INFO] mrcp_client_session.c:420 Send Offer
0x84a82b0 <new> [c:0 a:1 v:0]
 4098364  2010-03-05 16:47:48.208263 [INFO] mrcp_client_session.c:158 Receive Answer
0x84a82b0 <0532c32ad6abe740> [c:0 a:1 v:0]
 4098365  2010-03-05 16:47:48.208263 [DEBUG] mrcp_client_session.c:1160 Modify
Termination 0x84a82b0 <0532c32ad6abe740>

Pay attention to the delay here. You may see the same timed out request, but 
the big
question is what was the reason. 

 4115033  2010-03-05 16:47:58.152993 [ERR] mod_unimrcp.c:963 (TTS-9973) Timed out
waiting for channel to be ready
 4116285  2010-03-05 16:47:58.328997 [DEBUG] mrcp_client_session.c:1043 On
Termination Modify 0x84a82b0 <0532c32ad6abe740>

The expected "On Termination Modify" was eventually received. I suspect the MPF
callback was blocked from mod_unimrcp context, at least I see that all the other
threads of client stack were busy processing other messages.

 4116286  2010-03-05 16:47:58.328997 [INFO] mrcp_client_session.c:459 Raise App
Response 0x84a82b0 <0532c32ad6abe740> [2] SUCCESS [0]
 4116287  2010-03-05 16:47:58.328997 [DEBUG] mod_unimrcp.c:1752 (TTS-9973)
SYNTHESIZER channel is ready, codec = LPCM, sample rate = 8000
 4116288  2010-03-05 16:47:58.328997 [DEBUG] mod_unimrcp.c:1469 (TTS-9973) CLOSED ==>
READY

BTW, I have been doing a few other tests in parallel now and haven't seen this 
issue
on other products yet (not tried Asterisk though). So I suppose it's FS related.
Hope this helps a bit.

Original comment by achalo...@gmail.com on 5 Mar 2010 at 1:39

GoogleCodeExporter commented 9 years ago

I have a fix to the FS core that I'm going to try today.  Found some cases 
where a FS
call can terminate without cleaning up its mod_unimrcp session.

Original comment by cmrie...@gmail.com on 5 Mar 2010 at 2:05

GoogleCodeExporter commented 9 years ago

I'm going to make a change to mod_unimrcp so that it does not timeout when 
destroying
the session.  It will be better to let it stay stuck.  At least I can attach to 
the
process then and see what is going on.

Plus, we are now seeing how bad things can get if I let freeswitch destroy its
session without cleaning up the MRCP session.

Original comment by cmrie...@gmail.com on 5 Mar 2010 at 3:52

GoogleCodeExporter commented 9 years ago

Agreed, as it's indeed a significant error condition.

Original comment by achalo...@gmail.com on 5 Mar 2010 at 4:27

GoogleCodeExporter commented 9 years ago

Hello Arsen,

Tony checked in a fix to the FS core to prevent my crash and I checked in a fix 
to
mod_unimrcp to no longer allow timeout when terminating the session.

Could you send me a copy of your test setup?  SIPp script + command line args, 
FS
dialplan with TTS request, and UniMRCP server config.  I'll try to reproduce in 
the lab.

Original comment by cmrie...@gmail.com on 9 Mar 2010 at 7:00

GoogleCodeExporter commented 9 years ago

Hello Chris,

I'm glad the issue has been fixed and will try to re-test it on my end during 
the days.
At the moment, I have no access to the testbed to provide you with the exact 
setup.
Perhaps, there was nothing extraordinary. As far as I remember, I tried 5 calls 
per
sec with 3 sec call duration and typical FS TTS dialplan.

Original comment by achalo...@gmail.com on 9 Mar 2010 at 8:36

GoogleCodeExporter commented 9 years ago

Just to let you know that I have also implemented Chris' fix into app_unimrcp.

Original comment by thirion...@gmail.com on 10 Mar 2010 at 8:11

GoogleCodeExporter commented 9 years ago

FYI,
I have made 50000 calls with FreeSWITCH trunk (r16966) today ... no more 
crashes! It
crashed on about 4000 calls before.

./sipp -r 5 -rp 1000 -d 2500

<extension name="unispeak">
   <condition field="destination_number" expression="^7004$">
        <action application="answer"/>
        <action application="set" data="tts_engine=unimrcp"/>
        <action application="speak" data="Hello world from FreeSWITCH"/>
   </condition>
</extension>

The same is true for Asterisk. I have tried the latest app_unimrcp (r1579) and
reached the same 50000 calls without any noticeable issues using MRCPSynth.

I'll try to do more tests and document setup when possible.

Original comment by achalo...@gmail.com on 11 Mar 2010 at 2:17

GoogleCodeExporter commented 9 years ago

Chris, I have added you as a committer to the project. 
http://code.google.com/p/unimrcp/people/list

So you have privileged access to svn, wiki, tracker, etc. This would allow us to
collaborate more efficiency, I guess.

Glad to work with you!
Arsen

Original comment by achalo...@gmail.com on 11 Mar 2010 at 2:26

GoogleCodeExporter commented 9 years ago

Thanks for the vote of confidence, Arsen. :)

I am currently ramping up the load on my production server and I haven't seen 
any
issues since the last round of fixes, either.  Once FreeSWITCH finishes their 
1.0.5
release, I'll resume work on getting mod_unimrcp up to date with the latest rev 
of
UniMRCP.

Original comment by cmrie...@gmail.com on 12 Mar 2010 at 2:01

GoogleCodeExporter commented 9 years ago

Well and thanks for the update, Chris.

So I consider the issue has been fixed.

Original comment by achalo...@gmail.com on 12 Mar 2010 at 12:39

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

Original comment by thirion...@gmail.com on 13 Dec 2010 at 12:29

Changed state: Verified

andrenatal / unimrcp

crash in stream_write callback #75