elan-ev / tobira

Video portal for Opencast
https://elan-ev.github.io/tobira/
Apache License 2.0
22 stars 17 forks source link

Tobira seems to freeze randomly, not responding to HTTP requests at all anymore #1129

Closed LukasKalbertodt closed 6 months ago

LukasKalbertodt commented 7 months ago

From the ETH and Bern, we got reports of Tobira randomly stopping to work. No more log outputs are happening, and HTTP requests are not responded to anymore at all. The process still runs, but it seems like it's not doing anything anymore. Log does not have any relevant entries.

Bern is running v2.5 (where it first occurred AFAIK), while ETH is running v2.6 (and they apparently did not see that problem on v2.5). Both are running it behind an nginx. From ETH I heard that it could happen immediately, or after hours or days. Bern reported it failing on roughly 26.02. 17:27, 22.02. 21:10 and 10.02. 16:53.

Playing around on the ETH system right now, I could somewhat reliably cause the problem to occur by immediately refreshing the page after restarting Tobira. I couldn't yet identify a relevant HTTP request, but assume that it's a graphql request. The index.html was almost (I think?) delivered correctly. The problem also occurred with the -musl build of v2.6. ETH observed that the memory-usage as reported by systemctl status was at only 784KB for the frozen Tobira, but 50MB for a normally running one. I did not see those numbers in my experiments, but it might still be useful information. The frozen process does not consume any noticeable CPU time (i.e. not a full core, for example).

I now deployed a MUSL build of 97d28b28cecaac6c5d16622877b9e5434aad2896 which is just v2.6 + #1120. There we update Tokio and hyper, two prime candidates for causing things like that. So far I could not reproduce the problem, I think? I will see if it continues to work for a few days.

I have not yet gone through the changelogs of Tokio and hyper, but will do so soon. Other things I have planned to debug this:

LukasKalbertodt commented 7 months ago

I skimmed the changelogs for tokio and hyper (starting from the version used in Tobira v2.5). The following issues are potentially relevant. But I haven't looked into any of those in detail:

oas777 commented 7 months ago

ETH is running v2.6 (and they apparently did not see that problem on v2.5).

FWIW: I saw that problem once or twice before we upgraded , at least if it is the problem we saw at our meeting last week /discussed with Piri and Waldemar.

LukasKalbertodt commented 6 months ago

See the linked PR for an in-depth explanation of the cause. To add some more useful detail: