Tuner "Stuck", not released by fHDHR?

arrmo commented 2 years ago

Hi,

OK, I finally caught the case I have been chasing 😆. No solution yet, but to capture it here, by all means I'm open to suggestions!

The issue is that I see in the ceton API (status) shows tuner "1" as external - no stream args, but appears to be in use (file attached). Checking the fHDHR tuner status (also attached) ... it seems to be in use, for 12 hours?!?! But no stream args passed to the same ceton tuner object?

Have you seen this type of thing before? I'll keep digging, but it is a bit odd.

FYI, I can kill ffmpeg from the web UI, but when I try to stop the Ceton tuner ... it keeps restarting. And I don't see Plex using it - I can't seem to see where / if it's really used. Hmmm.

Thanks! ceton.json.txt tuners.json.txt

DanAustinGH commented 2 years ago

I've seen that pattern once or twice. Restarting fhdhr cleared it. Could be something is resetting the retry counter, but I have not figured out how/why yet either.

arrmo commented 2 years ago

FYI, it seems there is a correlation between this "stuck" state and the output stream "stuttering" (choppy output). So it may be a bit more of an issue than just notification / information => actual performance impact of some sort.

Restarting fHDHR is the fix ... but I also find, when in this state - the restart is VERY slow (i.e. a minute or more).

Thanks!

arrmo commented 2 years ago

OK, a bit more on this - have added some debug output, to try to find when this is happening. Very odd, but in the log I see this occurring ... thoughts? Really does seem to be an fHDHR related issue.

All within ~ 0.5 sec?

[2022-05-26 20:50:54,531] NOOB - [ceton] get_channel_stream ... 
> stream_args: {"channel": "508", "channel_name": "WFAA ABC HD", "channel_callsign": "WFAA ABC HD", "origin_name": "ceton", "method": "ffmpeg", "duration": 0, "transcode_quality": "active", "tuner_number": null, "accessed": "http://192.168.2.64:5006/hdhr/ceton/auto/v508", "base_url": "http://192.168.2.64:5006", "client": "192.168.2.64", "client_id": "82ab713d-900c-4648-be75-94230cfb05b1", "origin_quality": null, "bytes_per_read": 1152000, "buffer_size": 3, "stream_restore_attempts": 3}
[2022-05-26 20:50:54,537] NOOB - [ceton] get_channel_stream ... 
> stream_args: {"channel": "574", "channel_name": "ESPN 2 HD", "channel_callsign": "ESPN 2 HD", "origin_name": "ceton", "method": "ffmpeg", "duration": 0, "transcode_quality": "active", "tuner_number": null, "accessed": "http://192.168.2.64:5006/hdhr/ceton/auto/v574", "base_url": "http://192.168.2.64:5006", "client": "192.168.2.64", "client_id": "2f69610c-ffaf-4491-974c-7154b3aae3f7", "origin_quality": null, "bytes_per_read": 1152000, "buffer_size": 3, "stream_restore_attempts": 3}
[2022-05-26 20:50:54,634] NOOB - [ceton] get_channel_stream ... 
> stream_args: {"channel": "508", "channel_name": "WFAA ABC HD", "channel_callsign": "WFAA ABC HD", "origin_name": "ceton", "method": "ffmpeg", "duration": 0, "transcode_quality": "active", "tuner_number": null, "accessed": "http://192.168.2.64:5006/hdhr/ceton/auto/v508", "base_url": "http://192.168.2.64:5006", "client": "192.168.2.64", "client_id": "cbe1d45e-2b69-4a2a-9743-78d8a66294fe", "origin_quality": null, "bytes_per_read": 1152000, "buffer_size": 3, "stream_restore_attempts": 3}
[2022-05-26 20:50:54,730] NOOB - [ceton] get_channel_stream ... 
> stream_args: {"channel": "508", "channel_name": "WFAA ABC HD", "channel_callsign": "WFAA ABC HD", "origin_name": "ceton", "method": "ffmpeg", "duration": 0, "transcode_quality": "active", "tuner_number": null, "accessed": "http://192.168.2.64:5006/hdhr/ceton/auto/v508", "base_url": "http://192.168.2.64:5006", "client": "192.168.2.64", "client_id": "555bdc50-d4f6-42e4-88bd-4ada9e753447", "origin_quality": null, "bytes_per_read": 1152000, "buffer_size": 3, "stream_restore_attempts": 3}
[2022-05-26 20:50:54,890] NOOB - [ceton] get_channel_stream ... 
> stream_args: {"channel": "574", "channel_name": "ESPN 2 HD", "channel_callsign": "ESPN 2 HD", "origin_name": "ceton", "method": "ffmpeg", "duration": 0, "transcode_quality": "active", "tuner_number": null, "accessed": "http://192.168.2.64:5006/hdhr/ceton/auto/v574", "base_url": "http://192.168.2.64:5006", "client": "192.168.2.64", "client_id": "e85463c0-e8b6-431a-9135-f5c546efb37b", "origin_quality": null, "bytes_per_read": 1152000, "buffer_size": 3, "stream_restore_attempts": 3}

Some sort of stuck loop? This goes on hundreds of times!

[2022-05-26 20:51:31,973] NOOB - [ceton] Initiate PCIe direct streaming, channel 508 from Ceton tuner#: 1 
[2022-05-26 20:51:31,978] NOOB - [ffmpeg] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-05-26 20:51:31,979] NOOB - [ffmpeg] Client requested a active transcode for stream.
[2022-05-26 20:51:31,980] NOOB - [ffmpeg] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_1', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-05-26 20:51:32,302] NOOB - [ceton] Preparing Ceton tuner 1 on port: 8001
[2022-05-26 20:51:33,146] NOOB - [ceton] Initiate PCIe direct streaming, channel 554 from Ceton tuner#: 1 
[2022-05-26 20:51:33,151] NOOB - [ffmpeg] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-05-26 20:51:33,152] NOOB - [ffmpeg] Client requested a active transcode for stream.
[2022-05-26 20:51:33,152] NOOB - [ffmpeg] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_1', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-05-26 20:51:33,217] NOOB - [ffmpeg] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-05-26 20:51:33,218] NOOB - [ffmpeg] Client requested a active transcode for stream.
[2022-05-26 20:51:33,219] NOOB - [ffmpeg] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_1', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-05-26 20:51:33,289] NOOB - [ffmpeg] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-05-26 20:51:33,290] NOOB - [ffmpeg] Client requested a active transcode for stream.
[2022-05-26 20:51:33,291] NOOB - [ffmpeg] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_1', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-05-26 20:51:33,361] NOOB - [ffmpeg] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-05-26 20:51:33,362] NOOB - [ffmpeg] Client requested a active transcode for stream.
[2022-05-26 20:51:33,363] NOOB - [ffmpeg] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_1', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-05-26 20:51:33,437] NOOB - [ffmpeg] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-05-26 20:51:33,438] NOOB - [ffmpeg] Client requested a active transcode for stream.
[2022-05-26 20:51:33,439] NOOB - [ffmpeg] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_1', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-05-26 20:51:33,507] NOOB - [ffmpeg] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-05-26 20:51:33,508] NOOB - [ffmpeg] Client requested a active transcode for stream.
[2022-05-26 20:51:33,509] NOOB - [ffmpeg] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_1', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-05-26 20:51:33,579] NOOB - [ffmpeg] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-05-26 20:51:33,580] NOOB - [ffmpeg] Client requested a active transcode for stream.
[2022-05-26 20:51:33,580] NOOB - [ffmpeg] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_1', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-05-26 20:51:33,652] NOOB - [ffmpeg] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json

Will keep going through it!

DanAustinGH commented 2 years ago

I wonder if the 'stream_retry_attempts' can be a clue. I see two possibilities, we are not returning a failure in get_stream properly so the code is retrying when it does not need to, or we are but the retry logic is not ceton 'friendly'. That assumes that stream_retry_attempts works as expected on other sources.

The 2nd log is interesting. It has a pattern 1ms,2ms,3ms,70ms,1ms,2ms,3ms,70ms and so on. Will need to look through the code again, but I would not expect that bit of code to be inside any kind of look.

arrmo commented 2 years ago

Will dig here as well! Almost seems like a race condition, but I may be wrong. It's why I also detect a connection as "External" (I think 😆) ... seems like there may need to be a wait / check for the tuner state to properly change before proceeding forward?

Thanks!

arrmo commented 2 years ago

Well, interestingly enough - I am trying to create a test module => runs ffmpeg as a client, streaming directly from fHDHR / Ceton (e.g. http://192.168.2.64:5006/api/tuners?method=stream&channel=593&origin=ceton), to be able to debug (i.e. automated start and stop of streams, random timing). I am getting it going, and quite quickly created a soft lockup ... LOL. A bunch of shell messages like,

 kernel:[23568.826764] watchdog: BUG: soft lockup - CPU#6 stuck for 250s! [ffmpeg:316493]

And yep - that process number is ffmpeg (from fHDHR),

0 S root      316493       1 99  80   0 - 587238 -     42804   6 19:47 ?        00:04:23 /usr/bin/ffmpeg -hwaccel cuda -deint 1 -hwaccel_output_format cuda -i /dev/ceton/ctn91xx_mpeg0_1 -reconnect 1 -reconnect_at_eof 1 -reconnect_streamed 1 -reconnect_delay_max 2 -vf scale_cuda=1920:1080 -c:v hevc_nvenc -preset p6 -tune hq -b:v 3M -c:a ac3 -b:a 196k -f mpegts -nostats -hide_banner -loglevel info pipe:stdout

But - I couldn't even stop (kill) ffmpeg! To me that seems odd - but it does seem that this instance of ffmpeg is what is getting stuck. Any thoughts? I had to reboot the entire machine, no fix it seems?

Thanks!

DanAustinGH commented 2 years ago

Were you logging the duration time between the start and stop? How were you stopping the process? Part of me thinks this happens when a terminate is sent too soon after start-up, but I have no details to back that up.

I also came across this bit that seems to make sense, that passing a quit command to the ffmpeg and allow it to exit might make sense. https://github.com/kkroening/ffmpeg-python/issues/162

arrmo commented 2 years ago

Were you logging the duration time between the start and stop? How were you stopping the process? Part of me thinks this happens when a terminate is sent too soon after start-up, but I have no details to back that up.

So funny - I say that because I was starting to think timing is driving this also. But not necessarily time from start to stop, but rather multiple starts too close to each other. I say that because when I run my scheduler / test routine with one job (tuner), no issue. But if I start a couple at the same time ... boom. Or at least I think so 😜. Let me test more, get back to you!

I also came across this bit that seems to make sense, that passing a quit command to the ffmpeg and allow it to exit might make sense

I like it! But how to pass q to a running process? Thanks!

DanAustinGH commented 2 years ago

I am fairly certain we are spawning ffmpeg as a process, so this seems like it could fit- process.communicate(str.encode("q")) #Equivalent to send a Q

arrmo commented 2 years ago

That makes sense, but fHDHR seems to be non-responsive once this happens => trying to find another way to kill ffmpeg 😞.

Let me see if I can make it happen more reliably, then it's easier to debug.

DanAustinGH commented 2 years ago

So this would be hard to confirm, but how do we know that the process is not responsive until we try to stop it? Once we have a stuck ffmpeg, it is well and truly stuck. I have not found anyway to kill it short of a reboot.

arrmo commented 2 years ago

So this would be hard to confirm, but how do we know that the process is not responsive until we try to stop it?

Good question ... LOL. FYI, I am running fHDHR as a service => when things go south (soft lockup), then killing the service takes a LOOONG time (if it works at all), and it doesn't kill ffmpeg. I tried to manually kill (kill -9) ffmpeg ... no luck, and after a period of time, the full (Linux) OS hangs.

So this would be hard to confirm, but how do we know that the process is not responsive until we try to stop it? Once we have a stuck ffmpeg, it is well and truly stuck. I have not found anyway to kill it short of a reboot.

Same here! That's the issue. And - I think I may have found how to trigger it (still not 100% sure, but perhaps try it there also?). Start at the "same" time ... two clients (ffmpeg, VLC, whatever). Then I get the pattern like above (point 2). Make sense?

Thanks!

DanAustinGH commented 2 years ago

A race condition seems possible-

We fire multiple get_streams and make the API calls to start the tuners, overwhelming the tuner, not likely, but possible. Could mitigate this by submitting API calls to a thread to serialize them and insert a brief pause between then.
We are starting the ffmpeg process too soon in some cases against a tuner that is not feeding the hardware interface. Not sure how sensitive ffmpeg would be to not receiving input, this one also seems unlikely. In the outside chance this is happening again a brief pause in get stream before returning the streamurl to the core

And to ask my 'how do you know question' another way. I have only see the soft-lockup AFTER trying to stop the process inside of fHDHR by clicking a close button. I had an orphaned ffmpeg process, but it's state appeared normal before trying to stop it. So if you launch a bunch of sessions as fast as possible, do you see one or more go into soft-lockup, or is it only when you try to stop one of those sessions? In my case I think one of the beta Emby releases was retrying to tune too aggressively, as I have not re-encounterd this issue recently, and my only change has been to update as each beta server version is released...

arrmo commented 2 years ago

Yep, all makes sense. Pulling my hair out here - it seems I can make this happen somewhat consistently (sadly ... LOL!). And my "External" issue may be similar. Let me try to explain.

Ran a case, single tuner, just monitoring the state of the tuner. The output from some debug I added, below,

[2022-06-10 21:23:09,880] NOOB - [ceton][get_channel_stream:305] get_channel_stream ... 
> stream_args: {"channel": "505", "channel_name": "KXAS NBC HD", "channel_callsign": "KXAS NBC HD", "origin_name": "ceton", "method": "ffmpeg", "duration": 0, "transcode_quality": "active", "tuner_number": null, "accessed": "http://192.168.2.64:5006/api/tuners?method=stream&channel=505&origin=ceton", "base_url": "http://192.168.2.64:5006", "client": "192.168.2.64", "client_id": "bf6c3da8-76c5-48c0-bb42-2c727483458d", "origin_quality": null, "bytes_per_read": 1152000, "buffer_size": 3, "stream_restore_attempts": 3}
[2022-06-10 21:23:10,422] NOOB - [ceton][get_ceton_tuner_status:156] Inactive, not scan. Called from [get_channel_stream:306][/mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_origin_ceton/origin/__init__.py]
[2022-06-10 21:23:10,423] NOOB - [ceton][get_ceton_tuner_status:159] Selected Ceton tuner#: 0
[2022-06-10 21:23:10,424] NOOB - [ceton][get_ceton_tuner_status:168] > scan: False
[2022-06-10 21:23:10,424] NOOB - [ceton][get_ceton_tuner_status:169] > hwinuse: 0
[2022-06-10 21:23:10,425] NOOB - [ceton][get_ceton_tuner_status:170] > transport: STOPPED
[2022-06-10 21:23:10,426] NOOB - [ceton][get_ceton_tuner_status:171] > status: Inactive
[2022-06-10 21:23:10,427] NOOB - [ceton][get_ceton_tuner_status:172] > ceton_pcie: True
[2022-06-10 21:23:10,427] NOOB - [ceton][startstop_ceton_tuner:200] Ceton tuner 0 to be started
[2022-06-10 21:23:10,428] NOOB - [ceton][startstop_ceton_tuner:201] Setting tuner status to "Active"
[2022-06-10 21:23:10,519] NOOB - [ceton][get_channel_stream:312] > port: 8000
[2022-06-10 21:23:10,520] NOOB - [ceton][get_channel_stream:313] > status: Inactive
[2022-06-10 21:23:10,731] NOOB - [ceton][get_channel_stream:320] Preparing Ceton tuner 0 on port: 8000
[2022-06-10 21:23:11,029] NOOB - [ceton][get_channel_stream:334] Initiate PCIe direct streaming, channel 505 from Ceton tuner#: 0 
[2022-06-10 21:23:11,036] NOOB - [ffmpeg][get:81] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-06-10 21:23:11,036] NOOB - [ffmpeg][ffmpeg_command_assemble:113] Client requested a active transcode for stream.
[2022-06-10 21:23:11,037] NOOB - [ffmpeg][get:88] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_0', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-06-10 21:23:13,154] NOOB - [ceton][get_ceton_tuner_status:176] Not Inactive. Called from [get:26][/mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_origin_ceton/web/ceton_api.py]
[2022-06-10 21:23:13,155] NOOB - [ceton][get_ceton_tuner_status:179] > instance: 0
[2022-06-10 21:23:13,156] NOOB - [ceton][get_ceton_tuner_status:180] > scan: True
[2022-06-10 21:23:13,156] NOOB - [ceton][get_ceton_tuner_status:181] > hwinuse: 1
[2022-06-10 21:23:13,157] NOOB - [ceton][get_ceton_tuner_status:182] > transport: PLAYING
[2022-06-10 21:23:13,158] NOOB - [ceton][get_ceton_tuner_status:183] > status: Inactive
[2022-06-10 21:23:13,158] NOOB - [ceton][get_ceton_tuner_status:184] > ceton_pcie: True

What is really screwy here - these two lines (extracted out),

[2022-06-10 21:23:10,428] NOOB - [ceton][startstop_ceton_tuner:201] Setting tuner status to "Active"
[2022-06-10 21:23:10,520] NOOB - [ceton][get_channel_stream:313] > status: Inactive

I set the value of status to Active, then print it out right after ... and it's Inactive. Huh? It's like there is a delay setting the value?!?!

Similar, I then ran my test program, starting 2 clients right after each other (almost the same time, but can't be of course). This is the log output - sorry, it's a pic ... this cause a soft lockup, couldn't copy and paste the text any more. Grabbed this before the view crashed (I had to reboot, the machine locked up) Interestingly, both requests are being assigned to the same tuner - that can't be good. LOL! Will keep digging, but I think this may be the issue.

Thoughts?

Thanks!

DanAustinGH commented 2 years ago

And we have a winner. I had no evidence to prove it, but your logging shows what I thought would happen. Two or more requests hit at almost the same time, close enough that the 1st has not gotten far enough along to mark the tuner active and unavailable, the second starts grabbing it. Now I would normally say the worst we would see is the 2nd client sets the channel and the 1st does not get what they asked for.

But we saw that FFMPEG reading from a device can be blocking, so if both ffmpeg processes are launched against the same hardware device, one may well be starved and never get to the point that it can be killed.

We might be able to set a tuner to a new state 'PENDING' as soon as a request to tune comes in, I mean like the very first step, that way a second request looking for INACTIVE will move on to the next tuner. If any easy code change like that does not work, we will need to look into what we can do in Python with locking and other methods to ensure that data structures can only be read/updated by one process at a time.

Good job on the debugging effort...

arrmo commented 2 years ago

Yes, what you say makes sense - and was thinking something similar. Of course with multiple clients, we can (and will) hit the (test) case above, and also I'm not fully convinced that Plex doesn't rapidly do this when changing channels (no proof ... yet, LOL. Just looking at logs it seems it may be the case).

But to your comments, my worry is 1. above. It seems like Python is slow to set a variable ... I set it, then read it just after - and it's still the old value? That worries me in terms of using some sort of lockout mechanism. More to noodle on there.

Thanks!

DanAustinGH commented 2 years ago

This might be even easier. In get_ceton_tuner_status we check for transport and hwinuse. We could also check for status 'Inactive', and mark the status as 'Pending' there. That might, and I mean might be enough to have the 2nd client move to the next tuner. As it stands we appear to be using status for logging and the UI and not for choosing which tuner to select. I think if we have two requests that hit very close the exact same time this might still trigger, but I think using a transitional state and checking it might narrow the window. If it does not we'd need to add a lock as the only way to ensure that only one process can search/update the tunerstatus dict at a time...

arrmo commented 2 years ago

What you say makes sense, and I think this may be relatively easy to do (fingers crossed ... 😆). But - to get there, I need to figure this out - and it's wracking (and wrecking) my brain. Let me try to explain.

Some info from the logs (and some debug info I have added to get to the bottom of this),

[2022-06-11 09:46:36,321] NOOB - [ceton][get_channel_stream:304] get_channel_stream ... 
> stream_args: {"channel": "505", "channel_name": "KXAS NBC HD", "channel_callsign": "KXAS NBC HD", "origin_name": "ceton", "method": "ffmpeg", "duration": 0, "transcode_quality": "active", "tuner_number": null, "accessed": "http://192.168.2.64:5006/api/tuners?method=stream&channel=505&origin=ceton", "base_url": "http://192.168.2.64:5006", "client": "192.168.2.64", "client_id": "b2e3a03a-3b3f-4ed2-b2dc-1a1a8a7b06ad", "origin_quality": null, "bytes_per_read": 1152000, "buffer_size": 3, "stream_restore_attempts": 3}
[2022-06-11 09:46:36,627] NOOB - [ceton][get_ceton_tuner_status:156] Inactive, not scan. Called from [get_channel_stream:305][/mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_origin_ceton/origin/__init__.py]
[2022-06-11 09:46:36,627] NOOB - [ceton][get_ceton_tuner_status:159] Selected Ceton tuner#: 0
[2022-06-11 09:46:36,628] NOOB - [ceton][get_ceton_tuner_status:168] > scan: False, hwinuse: 0, transport: STOPPED, status: Inactive, ceton_pcie: True
[2022-06-11 09:46:36,628] NOOB - [ceton][startstop_ceton_tuner:198] Ceton tuner 0 to be started
[2022-06-11 09:46:36,628] NOOB - [ceton][startstop_ceton_tuner:199] Setting tuner status to "Active"
[2022-06-11 09:46:36,628] NOOB - [ceton][startstop_ceton_tuner:201] Checking, status: tuner 0, status Active
[2022-06-11 09:46:36,786] NOOB - [ceton][get_channel_stream:311] > port: 8000
[2022-06-11 09:46:36,787] NOOB - [ceton][get_channel_stream:312] > tuner: 0
[2022-06-11 09:46:36,788] NOOB - [ceton][get_channel_stream:313] > status: Active
[2022-06-11 09:46:36,949] NOOB - [ceton][get_channel_stream:320] Preparing Ceton tuner 0 on port: 8000
[2022-06-11 09:46:37,313] NOOB - [ceton][get_channel_stream:334] Initiate PCIe direct streaming, channel 505 from Ceton tuner#: 0 
[2022-06-11 09:46:37,318] NOOB - [ffmpeg][get:81] Loaded transcoding configurations file, /mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_stream_ffmpeg/transcode.json
[2022-06-11 09:46:37,319] NOOB - [ffmpeg][ffmpeg_command_assemble:113] Client requested a active transcode for stream.
[2022-06-11 09:46:37,319] NOOB - [ffmpeg][get:88] ffmpeg command: ['/usr/bin/ffmpeg', '-hwaccel', 'cuda', '-deint', '1', '-hwaccel_output_format', 'cuda', '-i', '/dev/ceton/ctn91xx_mpeg0_0', '-reconnect', '1', '-reconnect_at_eof', '1', '-reconnect_streamed', '1', '-reconnect_delay_max', '2', '-vf', 'scale_cuda=1920:1080', '-c:v', 'hevc_nvenc', '-preset', 'p6', '-tune', 'hq', '-b:v', '3M', '-c:a', 'ac3', '-b:a', '196k', '-f', 'mpegts', '-nostats', '-hide_banner', '-loglevel', 'info', 'pipe:stdout']
[2022-06-11 09:46:39,587] NOOB - [ceton][get_ceton_tuner_status:175] Not Inactive. Called from [get:26][/mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_origin_ceton/web/ceton_api.py]
[2022-06-11 09:46:39,588] NOOB - [ceton][get_ceton_tuner_status:178] > instance: 0, scan: True, hwinuse: 1, transport: PLAYING, status: Inactive, ceton_pcie: True
[2022-06-11 09:46:39,588] NOOB - [ceton][get_ceton_tuner_status:184] Not Active. Called from [get:26][/mnt/raid/ProgRAID/fHDHR/plugins/fHDHR_plugin_origin_ceton/web/ceton_api.py]
[2022-06-11 09:46:39,589] NOOB - [ceton][get_ceton_tuner_status:187] Ceton tuner, status "fail": Instance = 0, Status = Inactive

A request comes in (that I have initiated), through get_channel_stream, and this in turn calls get_ceton_tuner_status - all makes sense so far. It finds that tuner #0 is available (that's correct), and startstop_ceton_tuner is called - also correct. In startstop_ceton_tuner, the status is set, through self.tunerstatus[str(instance)]['status'] = "Active". I added a line right here to check, and yep ... Checking, status: tuner 0, status Active => all OK. And, back in get_channel_stream, yep, still OK ... > status: Active. But then it gets weird 😜. The API (from a call I have) checks the status, in get_ceton_tuner_status, but now status: Inactive. Huh?!?! Need to figure this out, then I can use Active and Inactive for dealing with the issue above - but first need them to be reliable.

Any suggestions welcome!

Thanks!

DanAustinGH commented 2 years ago

I found a few good articles on thread-safe dict access. The app is using threads with Flask, and so it is possible there is a context switch and the 2nd tread reads the status before it is updated, but then context switches back to the 1st thread before logging.

The more digging I did, the more convinced we'll need to have a mutex lock around updates to status. We should not need it for reads, only updates.

arrmo commented 2 years ago

That makes sense - though I have never used them before 🤣. Hoping you have!

If we can count on the status being reliable, I do think we can use it in a pretty easy way to handle the dual request issue.

Thanks!

arrmo commented 2 years ago

BTW, just yell if you want my test code. One small (minor) item in it I'm still working on, but in general it lets you trigger the issue.

DanAustinGH commented 2 years ago

I have not had a hung session in awhile, so best I can do is confirm it does not break something...

On the subject of locking, I have read about it in various projects out of general interest, and while I get the concept, I also have never implemented a solution using it. I may see if I can code something up in the next few days...

arrmo commented 2 years ago

Or - another thought on this. In parallel, I'll try to make use of the scan argument (passed to get_ceton_tuner_status). If this is only a scan (like the API uses), then don't set / change status - just use it as a check. It will "catch up", and no "harm" is done as it's not changing anything.

Make sense?

Thanks!

arrmo commented 2 years ago

OK, you hit the nail on the head! And ... I found a very easy solution 😄. Tested, and works. I did this,

    def get_channel_stream(self, chandict, stream_args):
        self.lock.acquire()
        self.plugin_utils.logger.noob('Locked ...')

=> Basically, as soon as a channel stream is requested ... lock it (first thing). Process this one, and then before returning,

        self.lock.release()
        self.plugin_utils.logger.noob('Unocked.')
        return stream_info

Checked it out, and - yep, it works! The first one is cleanly handled, then unlocks ... and let's the second one proceed. I tested even with 3 starting at the same time, and they sequentially take tuners 0, 1, 2.

Make sense? Seem like a reasonable approach?

Thanks!

deathbybandaid commented 2 years ago

Catching up, locks are pretty common in fhdhr, I look forward to a PR

DanAustinGH commented 2 years ago

Sweet! Now for the load and traffic volume the app will likely see the location of the lock will not matter (much), but ideally it would right before the point of contention inside the loop in ceton_get_tuner_status right before changing the state. But that is such a micro-optimization that it is not worth chasing, and having it that high up in the stack will cover other corner case that we might not have encountered.

I really wish we had a way to get metrics on how many people are using the plugin. I am aware of three, and I am biased, but this is a really excellent setup at this point, and solving this rare crash is icing on the cake...

arrmo commented 2 years ago

Catching up, locks are pretty common in fhdhr, I look forward to a PR

Yep! Need to fix the one other niggling thing that is "ticking" me off ... LOL! It's about the API, and status - but won't let that hold this fix up too much.

But that is such a micro-optimization that it is not worth chasing, and having it that high up in the stack will cover other corner case that we might not have encountered.

Agreed! I'll clean some other things up there, get the PR - and of course you can comment on it (please do!). My thought is the same as yours ... make it robust first, worry about further optimization once we make sure to avoid the "critical" (reboot causing) case.

I really wish we had a way to get metrics on how many people are using the plugin. I am aware of three, and I am biased, but this is a really excellent setup at this point, and solving this rare crash is icing on the cake...

So much agree! And I do have to say, with this 4-tuner card now, it really is an awesome setup. The whole project is awesome as far as I'm concerned. Thanks!

arrmo commented 2 years ago

OK, I seem to have this fully working here - PR coming right away. It handles the locking (like we talked about), also logic updates as the state was getting incorrectly set ... that was causing part of my confusion. By all means, check it out and comment. Thanks!

arrmo commented 2 years ago

PR #44

arrmo commented 2 years ago

Closing this now, PR merged.

fHDHR / fHDHR_plugin_origin_ceton

Tuner "Stuck", not released by fHDHR? #42