Transcode tests sometime fail leaving the server unresponsive

andrey-yantsen / plex-api.rs

Early alpha of library for communication with Plex API written in Rust

Apache License 2.0

27 stars 7 forks source link

Transcode tests sometime fail leaving the server unresponsive #541

Closed Mossop closed 1 year ago

Mossop commented 1 year ago

Sometimes it seems that requesting a transcode goes wrong leaving the server unresponsive for a time. This happens during tests, for example https://github.com/andrey-yantsen/plex-api.rs/actions/runs/4664337327/jobs/8256725420 and https://github.com/andrey-yantsen/plex-api.rs/pull/537#issuecomment-1506566691. In both cases the server seems to stop responding to http requests and then multiple tests fail. The differences in specific errors in those two cases might be explained by my errors coming from macOS and CI running linux but it's also possible there are two different issues.

Filing an issue for this as both me and @andrey-yantsen have said we want to look at it and so it might be useful to dump anything we find!

Mossop commented 1 year ago

While working on my own project I've found that requesting transcodes can cause the server to crash so my guess is that that is what it going on here. It's not clear whether the transcode requests are faulty or it is just a server bug (I would argue that if a bad request can crash the server then there is a server bug but since we have no control over the server...).

Mossop commented 1 year ago

I've been able to reproduce my own server crashing when cancelling a transcode session by writing a small app that loops infinitely creating a few offline transcode sessions then cancelling them in quick succession. I have a crash dump from the process but without debug symbols I'm not sure it is of much use.

andrey-yantsen commented 1 year ago

I wasn't expecting the issue to be caused by cancelling the session 😅 I thought we did something wrong when starting one.

Might it be because we cancel it too quickly after it started? Can you try terminating one only if progress is above 0?

Mossop commented 1 year ago

I've tested some more. Sometimes starting the transcode crashes the server. Sometimes cancelling it does (doesn't matter if I wait until its even 5% complete). Even adding delays between all the calls doesn't help. Everything I've tried doesn't help, if you attempt to transcode enough times the server crashes 😢

The fact that it's totally random still suggests to me that it's not something strictly that we're doing wrong, unless maybe something about our randomly generated session identifiers is off but even then it doesn't make sense to me why sometimes sessions will start find and then cancelling them crashes.

andrey-yantsen commented 1 year ago

Thank you for digging into this!

Would you mind sharing the code you used for testing? I'll try poking my server later this week.

Mossop commented 1 year ago

This is the code: https://gist.github.com/Mossop/984191c342460adbde96d91d11c87aa3

On a whim I tried installing the media server on my macbook (previously I was testing with my main server which is the Plex docker container) and that actually ran for a lot longer and I only ended up killing it because I needed to do something else so this may be a linux or docker only issue.

andrey-yantsen commented 1 year ago

I did some digging in the last few days and have not found anything besides what you already discovered.

The server crashes somewhat randomly, as long as there's no noticeable pause between the start & cancelling of the transcoding sessions and another one after cancelling. Testing servers inside a container on macOS M1 and my main server hosted on a Linux server were affected. Yet the native macOS version, in addition to working faster, does not crash.

In a few minutes, I'll add a warning in the docs for TranscodeSession::cancel() about this issue so that the end-user will be aware. It (quick starting & cancelling of transcoding) is an unusual behaviour, so I'm not sure whether the Plex team would react at all...

I'll also add some safeguarding to the tests, so we'll be waiting for the server to be alive before proceeding.

After that, I think we can close this issue and open a new one if we meet the problem during real-life use.