drachtio / drachtio-freeswitch-modules

A collection of open-sourced freeswitch modules that I use in various drachtio applications
MIT License
169 stars 114 forks source link

mod_aws_transcribe locks when calls is destroyed before transcription is stopped #136

Closed byoungdale closed 9 months ago

byoungdale commented 10 months ago

This issue was reported here in drachtio-fsmrf library. But, I am creating an isssue in this repo for it to get fixed.

When a channel is destroyed before the transcription thread is properly closed, the channel locks on switch_thread_join here.

Once onResponseCallback runs, there is no session available.

Example logs showing that we never see read thread complete log because switch_thread_join(&retval, cb->thread); gets locked waiting to join a thread that I think is already gone:

2023-09-24 16:09:51.520807 98.10% [INFO] aws_transcribe_glue.cpp:519 aws_transcribe_session_stop: finish..aws_transcribe
2023-09-24 16:09:51.520807 98.10% [DEBUG] aws_transcribe_glue.cpp:221 GStreamer::finish 0xffff20000f80
2023-09-24 16:09:51.520807 98.10% [DEBUG] aws_transcribe_glue.cpp:524 aws_transcribe_session_stop: waiting for read thread to complete aws_transcribe
2023-09-24 16:09:51.520807 98.10% [DEBUG] aws_transcribe_glue.cpp:275 GStreamer::writing disconnect event 0xffff20000f80
2023-09-24 16:09:51.601117 98.10% [DEBUG] aws_transcribe_glue.cpp:153 GStreamer 0xffff20000f80 stream got final response

fs_cli zombie call (this call is hungup and gone, but the thread is still stuck):

freeswitch@e295c381c2c5> show calls
uuid,direction,created,created_epoch,name,state,cid_name,cid_num,ip_addr,dest,presence_id,presence_data,accountcode,callstate,callee_name,callee_num,callee_direction,call_uuid,hostname,sent_callee_name,sent_callee_num,b_uuid,b_direction,b_created,b_created_epoch,b_name,b_state,b_cid_name,b_cid_num,b_ip_addr,b_dest,b_presence_id,b_presence_data,b_accountcode,b_callstate,b_callee_name,b_callee_num,b_callee_direction,b_sent_callee_name,b_sent_callee_num,call_created_epoch
a0445fab-1797-4771-a6ea-7007deca2116,inbound,2023-09-24 16:09:40,1695571780,sofia/drachtio_mrf/nobody@172.17.0.3:5060,CS_EXECUTE,unknown,,172.17.0.3,drachtio,,,,ACTIVE,,,,,e295c381c2c5,,,,,,,,,,,,,,,,,,,,,,

1 total.

freeswitch@e295c381c2c5> uuid_kill a0445fab-1797-4771-a6ea-7007deca2116
-ERR No such channel!

I haven't gotten to a fix yet. But, I think checking if the channel for the media bug is still available before trying to join that thread might work.

davehorton commented 10 months ago

The issue seems to be that the processData() function, which is running in the thread created to process responses from aws for that call, does not exit. I think to troubleshoot I suggest adding some more logging there.

Specifically, I wonder about this if condition

If that is not true -- if switch_core_session_locate(m_sessionId.c_str()); returns null, meaning the session is gone, I think we should add a log message so we can see that.

This log message:

stream got final response

Is the event that should cause the processData function to exit, and the thread to therefore end.

So if that is happening (I'm not sure it is) then we won't hit this code below in that if condition

                std::lock_guard<std::mutex> lk(m_mutex);
                m_finished = true;
                m_cond.notify_one();

                switch_core_session_rwunlock(psession);

That setting of the condition is what "wakes up" the processData() here:

        while (true) {
            std::unique_lock<std::mutex> lk(m_mutex);
            m_cond.wait(lk, [&, this] { 
                return (!m_deqAudio.empty() && !m_finishing)  || m_transcript.TranscriptHasBeenSet() || m_finished  || (m_finishing && !shutdownInitiated);
            });

Therefore, that thread may block there, waiting for a condition that never happens.

Just an idea, but try adding that logging to get further.

So...to summarize this theory:

  1. The session ends, as part of that we call m_pStream->close() at line 272
  2. This in turn causes AWS to complete transcribing and invoke OnResponseCallback
  3. In OnResponseCallback the call to switch_core_session_locate(m_sessionId.c_str()); fails because the session is gone
  4. Therefore no condition is raised and processData() blocks forever in its while loop

Please investigate when you have time @byoungdale and let me know what you think

byoungdale commented 10 months ago

Yes, thank you. Just saw this. The psession is gone, I had checked that. So, I added this else to the if (psession) { block:

                } else {
                    switch_log_printf(SWITCH_CHANNEL_LOG, SWITCH_LOG_DEBUG, "GStreamer %p session is closed/hungup. Need to unblock thread.\n", this);
                    std::lock_guard<std::mutex> lk(m_mutex);
                    m_finished = true;
                    m_cond.notify_one();
                }

and it unblocks processData's loop, I see the await ep.destroy() finishing, and I don't see the zombie call anymore.

But, does onResponseCallback only run when the transcription stream is finished? I'm having trouble finding AWS C++ SDK documentation that references onResponseCallback.

byoungdale commented 10 months ago

Ah, I see onResponseCallback is passed to StartStreamTranscriptionAsync.