Video frames dropped and not recorded after IceConnectionState: FAILED for a viewer

kputyra commented 3 years ago

Short description

A list broadcast is created with a REST API and used to publish via WebRTC a screen (1920x1080) captured by a browser. After the stream is successfully published, recording is turned on using REST API. During the first ~30min viewers come and go; at some moments there is no viewer. Then all viewers are notified of target bitrate 30000, while the stream bitrate varies between ~8000 (still image) and ~1M (new slide). Refetching the stream does not improve the target bitrate. A few minutes later the last viewer stops playing the stream and the server logs IceConnectionState: FAILED. From that moment all frames of the stream are dropped and new viewers cannot fetch the stream (the frames are still dropped). All of them receives the same bitrate measurements:

current video bitrate: 1688668
audio bitrate: 96000 (the stream has no audio)
webrtc client target bitrate: 172000

After 2h the stream is stopped, but only the first 37min are saved in a recording (till the frames got dropped).

Environment

Operating system and version: Ubuntu 18.04.5 LTS
Java version: 11.0.11
Ant Media Server version: Enterprise Edition 2.4.0 20210824_1123
Browser name and version: Chrome on Ubuntu

Expected behavior

The stream should be recorded in whole and the frames should not be dropped. If there is a problem on the publishing side that prevents recording, then either the stream should stop or vodReady should be sent to the webhook.

Further notes

We are not using the provided WebRTCadaptor, but our own, which follows the WebRTC WebSocket Messaging Reference. Publishing and playing streams works smoothly in normal circumstances.
The VoD entry in the database should have a field startDate that holds the timestamp of the first recorded frame. In a situation like the above we cannot synchronize the recording with other streams, because creationDate is the timestamp when the stream has stopped, which does not match the timestamp of the last recorded frame (because all frames were dropped from some point).
I don't know why the last viewer disconnected with state FAILED and whether it is relevant to the issue. I haven't found in logs any issue with the publisher.
Unfortunately, I don't know how to reproduce the issue. It happened during a lecture that was streamed for students.

Logs

An excerpt from the log file related to the stream (I can send more when needed): https://www.dropbox.com/s/s645wuzgiggoy0a/screen-share-problem.log?dl=0

mekya commented 2 years ago

Thank you @kputyra for the issue. Moving to the backlog

Mohit-3196 commented 2 years ago

Hi @kputyra, Thank you for the issue. I have been reading and trying to understand the issue. I have a few things to discuss like, You say that there's no such issue when you are using the sample pages to publish and play and the issue occurred when using WebRTC WebSocket Messaging Reference. Is that the case! Also as mentioned there's no reproduce scenario so is this issue repetitive/recurring or did it just happen once!

Thanks

kputyra commented 2 years ago

Hi @Mohit-3196

I don't claim the issue cannot arise when using the sample page, only that we're using a customized adapter. The messages are sent in the same order and we tried to follow both the WebSocket messaging reference and they way the adaptor handles the connection. It might be, though, that we have missed some notifications/error messages that can occur seldomly.

The issue happened so far only once and we're using the server for several months already; quite intense since September. It completely desynchronized recordings and we have seen no notification about that. There was no action from the publishing user at that point (or none that we are aware of).

What would definitively help us is to understand in which situation frames are dropped and how to prevent it. As I wrote, I would not expect frames to be dropped when a stream is recorded. If you need more details, I'll be happy to discuss the issue, for instance over zoom.

Best, Kris

Mohit-3196 commented 2 years ago

Hi @kputyra, Thank you again for writing back. So the stream recording is turned on through REST API. If the publisher iceconnection state is failed and re-connect again, the REST API should be called again if the stream is a zombi stream. Zombi streams are the streams that are not in the database and it's created on the fly. Also we can schedule a call for this Wednesday, December 1st at 18:00 (GMT+3) it and discuss more about it if you are available. Thanks

Mohit-3196 commented 2 years ago

Hi @kputyra , Looking forward to hear from you. Please let me know your availability and we can proceed accordingly. Thanks

kputyra commented 2 years ago

Hi @Mohit-3196

I'm sorry for my late response, I've been overwhelmed with other projects last month. If you have time, we can have a call this Wednesday (I'm available for most of the day) or during the first two weeks of January.

Yes, the recording is turned on through REST API, but the streams are also created using REST API before the publishing starts (we have turned off accepting unknown stream ids). If I understand you correctly, this means that they are not zombi streams. Anyway, the vodReady notification was fired only after publishing has stopped completely, which suggests that the recording was still going on.

Thanks

Mohit-3196 commented 2 years ago

Hi @kputyra No worries. Thank you for your response. Yes we can schedule a call in January first week. We can do it on Wednesday, 5th January at either 11:00 or 16:00 (GMT+3). So let me know what time suits you.

Thanks

kputyra commented 2 years ago

Hi @Mohit-3196 Do I understand correctly that 11:00 for you means 8:00 London time? Then both time slots are fine for me.

Thanks

Mohit-3196 commented 2 years ago

Hi @kputyra, Sure. Can you please share your mail address... Thanks

kputyra commented 2 years ago

Hi @Mohit-3196 Sure, it is visible now in my profile.

Mohit-3196 commented 2 years ago

Hi @kputyra, I have scheduled a meeting for Wednesday, 5th January. See you there.

Thanks

mekya commented 2 years ago

Thank you @kputyra for the update below. We'll schedule it again in this week to study the logs and scenario

I'm sending all the log lines with the ID of the problematic stream. This was actually a video stream (screen share) not an audio stream.

The setup

We use two streaming apps: SemLive as SFU (webcams, mic, screen) and SemLiveHQ with adaptive bitrate (4k ceiling cameras)

The speaker uses room equipment, both streamed to Ant Media via RTMP:

ceiling camera (ApLPZjENVzpFyRHzTBhvWzhndWkAwwnO @ SemLiveHQ)

ceiling microphone (kxIAzxTbkNOYjcRuWHhPJDkqSmIZMOoj @ SemLive) In addition he shared slides from his laptop via WebRTC. This creates two streams:

the original stream (vlRumxugjFKSbonTPctrhDxQUJexJvzZ)

a preview (160x120, FPS 5) generated in a browser with JS by drawing the video scaled on a canvas and capturing the stream (JRzTFKRNmxnZDfLsjyZqYwkQnyUtGYBA)

Here are direct links to the recordings:

The preview stream is published immediately after the speaker connects to the meeting, whereas the full quality screen stream is published only after requested by a viewer.

What has happened
At 8:01 the speaker initiated the room equipment from a dedicated Pi
interface in the room:
- ceiling camera
- ceiling microphone
At 8:03 the speaker connected the meeting from a laptop and shared
his screen, which published the screen preview.
At 8:04 the screen stream has been requested for the first time,
which triggered publishing of the full stream.

After 18min the preview stream (JRzTFKRNmxnZDfLsjyZqYwkQnyUtGYBA) was
stalled and no longer recorded.

The full quality screen stream (vlRumxugjFKSbonTPctrhDxQUJexJvzZ)
stalled and was no longer recorded after 38min.

All four streams were stopped around 10:01am.

Notes
According to the log, the bitrate of preview stream (JRzTFKRNmxnZDfLsjyZqYwkQnyUtGYBA, Publish Stats line) varies between 0 and 2. I'm a bit surprised by such a small number. The video bitrate for client stats is much higher (~4000).

Attachments
Access logs:
        access-vlRumxugjFKSbonTPctrhDxQUJexJvzZ.log
        access-JRzTFKRNmxnZDfLsjyZqYwkQnyUtGYBA.log
Ant Media server logs:
        ant-media-server-vlRumxugjFKSbonTPctrhDxQUJexJvzZ.log
        ant-media-server-JRzTFKRNmxnZDfLsjyZqYwkQnyUtGYBA.log

Let me know if you need more data, like more extracts from the log
files. The range of timestamps is enough.

The error message at 2021-11-09 10:01:04,058
At that time we were deleting a broadcast after receiving liveStreamEnded notification. Currently we stopped doing this, because in some cases this prevented Ant Media to send vodReady:

everything was fine during tests

vodReady was not sent for recorded streams with adaptive bitrate when on production server

The only difference is that our production server is on the same machine as Ant Media server, while the development server is on another one. I suppose the small extra latency was enough for vodReady to be triggered in the test environment, but when on the production server, the DELETE request is earlier and AMS no longer sends vodReady. The stream is recorded in both cases,

logs.zip.zip

SelimEmre commented 2 years ago

Hi guys!

This issue assigned to me. I was try to reproduce this issue. It seems that there is a network fluctuation issue on the publishing side in your first issue. You can see below log:

2021-11-09 08:40:26,691 [vert.x-eventloop-thread-26] INFO  i.a.enterprise.webrtc.WebRTCAdaptor - Client:1382213513 for stream vlRumxugjFKSbonTPctrhDxQUJexJvzZ current video bitrate: 8184 audio bitrate: 96000 webrtc client target bitrate: 30000
2021-11-09 08:41:36,691 [vert.x-eventloop-thread-26] INFO  i.a.enterprise.webrtc.WebRTCAdaptor - Client:1382213513 for stream vlRumxugjFKSbonTPctrhDxQUJexJvzZ current video bitrate: 660488 audio bitrate: 96000 webrtc client target bitrate: 30000

As I understand this network fluctuation caused IceConnectionState: FAILED error. But I couldn't figure out @mekya logs yet. I'm still trying to understand why it's happening.

SelimEmre commented 2 years ago

Hi @kputyra,

I'm investigating this issue in detail. Logs were filtered by stream ID's. We need to check full of Ant Media Server logs. Could you please share full of Ant Media Server logs?

kputyra commented 2 years ago

Hi @SelimEmre

Thank you for taking a look on this issue. I'm not sure what you exactly mean by network fluctuation. The affected streams were generated from a screen share of slides - a still image most of the time.

I'm attaching complete logs from 08:00 till 10:04 on that day, it covers the entire lifetime of the stream.

issue.log.gz

SelimEmre commented 2 years ago

Hi @kputyra,

Thanks for the details. I investigated this issue deeply. I had some ideas about your issue. Let me explain: I saw that your 2 streams(canvas and original) WebSocket communications disconnecting somehow. I'm suspicious about client Power(not enough RAM or CPU). Because I saw that there are a lot of dropping frames logs on the client-side. But there was no low bitrate issue.

What I recommend you:

Ant Media Server is supporting session_restore callback for minor WebSocket disconnections. It's supported by the latest version(v2.4.2.1). Please upgrade to the latest version. Auto republish mechanism is used by Default WebRTC Publisher page. As I know you are using a custom page for publishing. Please integrate Auto republish mechanism in your structure. When your streams can disconnect for any reason, your streams will continue with same stream.
Use medium FPS(10-15) or resolutions(720p) in the original stream. It can cover dropping frame issue. Please check it for more detail.

I hope, it helps you.

kputyra commented 2 years ago

Hi @SelimEmre

I don't think it's because of the client machine, it was one of the state-of-the-art laptops. I agree that both streams must have been disconnected as from that time all frames were dropped. What I don't understand is why Ant Media server did not notice that, but kept the stream marked as live. Note that all frames were dropped from the moment of disconnection.

The disconnection could've happened because we have two WiFi routers and operating systems (iOS as well as Windows 10+) sometimes decide to switch from one to another. We have already detected this behavior and when this happens, then all RTC connections are closed, while websocket connections are kept (the IP of the client does not changed). Currently, we listen to the native event connectionstatechange on an instance of RTCPeerConnection and try to republish the stream when the state is disconnected. It seems that this was not quite the case here.

The session_restore feature looks interesting and I will definitively integrate it to our publisher. I will consult your publisher on how to use it. Thank you for pointing it out!

SelimEmre commented 2 years ago

Hi @kputyra,

I don't think it's because of the client machine, it was one of the state-of-the-art laptops Thanks for the details. Please also consider canvas draws interval. It should 1000/Fps value. For example: 1000/15.

Please let me know your session_restore test.

Best Regards, Selim

mekya commented 8 months ago

Closing this issue for the inactivity. Please feel free to re-open if there is still a problem.

Cheers

ant-media / Ant-Media-Server