Latency and burst observed on GetMedia

jlong29 commented 4 years ago

Describe the bug KVS exhibits an odd delay pattern that results in a lot of latency.

To Reproduce Run the gstreamer plug-in with the following gst-launch-1.0 command (I am using an Nvidia TX1 dev kit with GPU accelerated h264 encoding): gst-launch-1.0 nvcamerasrc fpsRange="30.0 30.0" ! 'video/x-raw(memory:NVMM), width=(int)320, height=(int)240, format=(string)I420, framerate=(fraction)30/1' ! omxh264enc control-rate=1 bitrate=250000 peak-bitrate=500000 iframeinterval=7 profile=1 ! tee name=t t. ! h264parse ! video/x-h264,stream-format=avc,alignment=au,profile=baseline, framerate=30/1 ! kvssink stream-name="qwake-test" buffer-duration=5 storage-size=512 access-key="YOUR_PUBLIC" secret-key="YOUR_SECRET" aws-region="YOUR_REGION" t. ! h264parse ! omxh264dec ! nveglglessink -e

On the client side, I call KVS using the aws cli. I wrote a custom parser to make the sequence of mkv fragments play nice with my video player, mpv. This parser also pulls out the tags: Producer Timestamps, Server Timestamp and Mills From NOW.

Expected behavior I expect to receive each fragment at some average delay that varies as a function of network conditions. Instead I am seeing a single long From Now delay, followed by a burst of very short delays. I'm also seeing a single long Server Timestamp-Producer Timestamp difference followed by a burst of short differences. These long delays are setting the low bound upon latency.

SDK version number 3.0.0

Open source building

Screenshots Here's a spreadsheet I made based upon my observations: Delay-Analysis-15June2020-iframeInt7.pdf

Here's a sketch I made to reflect my understanding of what is going on: KVS-Tags-Visuals

Platform (please complete the following information):

OS: Linux
Processor: x86_64 on video Receiver, ARM7 on video Producer.
Version 16.04

MushMal commented 4 years ago

@jlong29 I took the liberty to edit the title and remove the "bug" tag.

This is an interesting case. Generally speaking, the initial call to GetMedia will result in some delay due to the backend mapping the selector and some other processing that happens. As soon as the egress starts, the stream can burst to catch up with the current (head). This is an expected behavior. However, what you are reporting is a continuous burst and pause.

I would suggest to include some latency information from the producer. Enable the debug logs and capture a few minutes of logs while you observe the "burstiness" on the consumer (GetMedia) side. Having the metrics printed out will give us some understanding whether the issue with burstiness is related to the producer.

MushMal commented 4 years ago

@jlong29 do you have any update on this issue?

jlong29 commented 4 years ago

@MushMal I intend to follow up on this issue in the next few days. I am just rather busy at the moment. Please keep this ticket open. If the issue is not due to the Producer and has to do with GetMedia, how should we proceed?

MushMal commented 4 years ago

Sounds good, we will keep the issue open, though, by all means, even if the issue is resolved, we can re-open it without any problems.

If the issue proves to be the "consumer" side then we will have to analyze where the latency is and what's causing it. It's hard to tell without any data. Generally speaking, I doubt that this could be related to the backend (though never say never) unless there is a bug as there is plenty of capacity and bandwidth overhead on the backend side.

It is possible that we see dropping of the connection on the producer side with consequent re-streaming from the rollback window which will increase the density of the stream and the latency.

We really need more data to understand your scenario.

Producer debug logs
What timestamps do you use? Server or Producer timestamp on the GetMedia side
Information about the network bandwidth and latencies.

From the description you gave where you've mentioned that the streams server side timestamps are also "bursty" it's a tale tale sign of a producer side issue.

jlong29 commented 4 years ago

@MushMal I was able to collect the logs you asked for me to collect. I'll try to provide the relevant details:

Endpoints

Gstreamer Producer gst-launch-1.0 nvcamerasrc fpsRange="30.0 30.0" ! 'video/x-raw(memory:NVMM), width=(int)320, height=(int)240, format=(string)I420, framerate=(fraction)30/1' ! omxh264enc control-rate=1 bitrate=250000 peak-bitrate=500000 iframeinterval=7 profile=1 ! tee name=t t. ! h264parse ! video/x-h264,stream-format=avc,alignment=au,profile=baseline, framerate=30/1 ! kvssink stream-name="MY-STREAM" buffer-duration=5 storage-size=512 avg-bandwidth-bps=31250 access-key="MY-KEY" secret-key="MY-SECRET" aws-region="us-east-1" t. ! h264parse ! omxh264dec ! nveglglessink -e > AWS-KVS-Test.log
aws cli with mpv and custom C parser aws kinesis-video-media get-media --start-selector StartSelectorType="NOW" --stream-name "MY-STREAM" --endpoint-url MY-URL /dev/stdout | ./build/kvsMkvStreamParserCtrl > >(mpv -v --profile=MY-PROFILE -) 2> >(tee stderr.log >&2)
A custom App I can't share, but that uses the GetMedia API directly. I see the same behavior on the App.

Log Files

Output by KVS Producer from Gstreamer Plugin (SDK 3.0.0)

AWS-KVS-Test.log

MPV configuration file

mpv.txt

Custom parser output. I wrote a C parser for the AWS mkv files according to the Java Parser in the Parser SDK. I parse out the Server Timestamp, Producer Timestamp, and "From Now" tags in each fragment. This TXT file and File has the raw data as the first 3 columns, and a few bits of basic analysis I did in subsequent columns

AWS-KVS-diagnostics-01July2020.txt

AWS-KVS-diagnostics-01July2020.pdf

MushMal commented 4 years ago

Hmm, interesting.. Thanks for the details. I just started looking at your producer side logs. I do see that in general, the network throughput is good. However, I see some interesting pattern here. Somewhere from the middle of the logs (same throughout)

I see a burst of key-frames:

Overall storage byte size: 536870912 Available storage byte size: 536858356 Allocated storage byte size: 12556 Total view allocation byte size: 6080 Total streams frame rate (fps): 34 Total streams transfer rate (bps): 257328 (251 Kbps) Current view duration (ms): 67 Overall view duration (ms): 237 Current view byte size: 4579 Overall view byte size: 12172 Current frame rate (fps): 34.2933 Current transfer rate (bps): 257328 (251 Kbps) 2020-07-02 05:28:53 [547767710208] DEBUG - Kinesis Video client and stream metrics Overall storage byte size: 536870912 Available storage byte size: 536852264 Allocated storage byte size: 18648 Total view allocation byte size: 6080 Total streams frame rate (fps): 33 Total streams transfer rate (bps): 257328 (251 Kbps) Current view duration (ms): 297 Overall view duration (ms): 467 Current view byte size: 10335 Overall view byte size: 17928 Current frame rate (fps): 33.2116 Current transfer rate (bps): 257328 (251 Kbps) 2020-07-02 05:28:54 [547767710208] DEBUG - Kinesis Video client and stream metrics Overall storage byte size: 536870912 Available storage byte size: 536844499 Allocated storage byte size: 26413 Total view allocation byte size: 6080 Total streams frame rate (fps): 32 Total streams transfer rate (bps): 257328 (251 Kbps) Current view duration (ms): 531 Overall view duration (ms): 701 Current view byte size: 17764 Overall view byte size: 25357 Current frame rate (fps): 32.3713 Current transfer rate (bps): 257328 (251 Kbps) 2020-07-02 05:28:54 [547767710208] DEBUG - Kinesis Video client and stream metrics Overall storage byte size: 536870912 Available storage byte size: 536837659 Allocated storage byte size: 33253 Total view allocation byte size: 6080 Total streams frame rate (fps): 31 Total streams transfer rate (bps): 257328 (251 Kbps) Current view duration (ms): 768 Overall view duration (ms): 938 Current view byte size: 24268 Overall view byte size: 31861 Current frame rate (fps): 31.4173 Current transfer rate (bps): 257328 (251 Kbps) 2020-07-02 05:28:54 [547767710208] ERROR - getKinesisVideoStreamData(): operation returned status code: 0x5200002f 2020-07-02 05:28:54 [547466240512] DEBUG - postReadCallback(): Wrote 26598 bytes to Kinesis Video. Upload stream handle: 0 2020-07-02 05:28:54 [547466240512] ERROR - getKinesisVideoStreamData(): operation returned status code: 0x5200002f 2020-07-02 05:28:54 [547466240512] DEBUG - postReadCallback(): Pausing CURL read for upload handle: 0 2020-07-02 05:28:54 [547466240512] DEBUG - Kinesis Video client and stream metrics Overall storage byte size: 536870912 Available storage byte size: 536831340 Allocated storage byte size: 39572 Total view allocation byte size: 6080 Total streams frame rate (fps): 30 Total streams transfer rate (bps): 256248 (250 Kbps) Current view duration (ms): 46 Overall view duration (ms): 1183 Current view byte size: 3653 Overall view byte size: 37844 Current frame rate (fps): 30.6614 Current transfer rate (bps): 256248 (250 Kbps)

As you can see, due to the burst, the networking can't immediately keep up so it takes some time to clear the buffer. For example, initially you have 67ms of duration in this sequence. By the time the 4th key-frame is produced, you have 768ms.

I am not familiar with this media pipeline/platform so you might want to play a little to ensure the encoder doesn't produce "corrugated pipe" effect. Some encoders do accumulate a few frame buffers that allow them to do a better job in encoding the video and have better motion compensation vector. This does result in "burstiness".

jlong29 commented 4 years ago

@MushMal Thank you for helping me understand the producer logs better. I had been a little unclear upon how the "Current view duration (ms)" parameter related to fragments. This being said, I don't agree with your analysis for a few reasons. Let me explain my Gstreamer media pipeline a bit. With the command:

gst-launch-1.0 nvcamerasrc fpsRange="30.0 30.0" ! 'video/x-raw(memory:NVMM), width=(int)320, height=(int)240, format=(string)I420, framerate=(fraction)30/1' ! omxh264enc control-rate=1 bitrate=250000 peak-bitrate=500000 iframeinterval=7 profile=1 ! tee name=t t. ! h264parse ! video/x-h264,stream-format=avc,alignment=au,profile=baseline, framerate=30/1 ! kvssink stream-name="MY-STREAM" buffer-duration=5 storage-size=512 avg-bandwidth-bps=31250 access-key="MY-KEY" secret-key="MY-SECRET" aws-region="us-east-1" t. ! h264parse ! omxh264dec ! nveglglessink -e

The first part is the camera:

gst-launch-1.0 nvcamerasrc fpsRange="30.0 30.0" ! 'video/x-raw(memory:NVMM), width=(int)320, height=(int)240, format=(string)I420, framerate=(fraction)30/1'

For testing, I used the camera provided by Nvidia in the TX1 dev kit and set the output parameters to match a camera I am using on my hardware platform. I then use the GPU accelerated encoder, omxh264enc:

omxh264enc control-rate=1 bitrate=250000 peak-bitrate=500000 iframeinterval=7 profile=1

The omxh264enc is very well documented, it is optimized for low-latency, I have worked with it extensively, and I have never heard anyone express concern about this burst behavior. The parameters "bitrate" and "peak-bitrate" set the average and peak bitrate. The "control-rate" parameter allows for the adaptive encoding and "iframeinterval" sets an I-Frame every 7th frame (so for 30 fps, once every ~233 ms here... a bit above the KVS threshold of 5 calls per second, 200 ms). For a fairly rigorous characterization of this h264 encoder, check out this work by Ridgerun

Notably, I then tee the encoded video stream to the kvssink element provided by AWS as well as to a display branch, which decodes the video using GPU accelerated h264 decoder, omxh264dec, and renders to a display attached to the TX1 dev kit. It is worth nothing that I see display playback in under 200ms.

I do not believe the problem is with the h264 encoder I am using. This points to either:

Burstiness in the download from the GPU to the CPU as the encoded buffers are transfered from the encoder to the kvssink element (Though, I think this is highly unlikely, since I previously used a very similar gstreamer pipeline with omxh264enc to stream video over 802.11ac wifi over rtp. There I saw less than 500ms latency. Though, I will check this possibility by replacing the GPU accelerated encoder with a CPU-side encoder. If this is the problem, it should result in uniform, though increased latency).
Something in the interaction between the kvssink element and the encoded buffers
Something in the interaction between the kvssink element and the KVS server

You should be able to recreate my observations using a RaspPI optimized for low-latency, which is a use case detailed extensively in the AWS KVS Producer SDK examples. Is it possible for you to recreate my observations on your side in parallel? Does my logic seem sound?

MushMal commented 4 years ago

@jlong29 the "burstiness" might also be an artifact of the GStreamer pipeline thread starvation itself. Here are the few steps I would recommend.

1) Ensure you have GoP sizes no less than 1 second - this is important due to potential throttling from the backend. 2) Use GStreamer queue element in between the encoder and the downstream elements like kvs sink 3) To eliminate the KVS sink, replace kvs sink with autovideosink 4) Use a testvideosrc with kvssink to eliminate the upstream dependency

We have not observed this behavior on the KVS sink side. The only possibility that I am thinking about is the pause/un-pause of the libCurl that we do in order to not spend CPU cycles on busy/idling when there is nothing in the buffer to send. The unpause is controlled by libCurl which can take at times a few hundred millis to complete.

chehefen commented 4 years ago

@jlong29 we can confirm that the burst is coming from the producer: This graph is generated by measuring the e2e latency of each frame when running KvsVideoOnlyStreamingSample in c producer repo. We will look into fixing this issue. Let's keep this issue open to track it.

jlong29 commented 4 years ago

@chehefen Thank you very much for following up with me on this issue. Our observations are consistent. Standing by.

jlong29 commented 4 years ago

@chehefen @MushMal Any updates upon the status of this issue? I ask because improving this behavior will greatly improve the service we are building upon KVS.

jlong29 commented 4 years ago

@chehefen @MushMal Any updates on this issue?

MushMal commented 4 years ago

We are still evaluating and prioritizing. Could you please clarify your usage scenario?

jlong29 commented 4 years ago

@MushMal Thanks for the update. We are building a low-latency streaming video service that natively stores all video streams in the Cloud. Our streams are video only, no sound, and are for real-time observation and offline review. Kinesis Video Streams is a good candidate because according to these points on the specs:

A client can call PutMedia up to five times per second per stream.
A client can send up to five fragments per second per stream.

KVS seems like a good candidate. We had assumed that the 5 fragments would be distributed uniformly in time, allows for latencies in the area of 250-450 ms. This is very good for our application. Does this answer your question? I'm happy to follow up with more details via email.

MushMal commented 4 years ago

TPS for PutMedia doesn't have affect the frame e2e latency.

KVS has WebRTC solution which is specifically designed to handle P2P and low-latency realtime scenario. The KVS itself is low-latency but due to the underlying libCurl EASY interface limitations is not an exactly "real-time" solution. It can be very fast indeed but most real-time solutions would have a direct peer feedback on congestion, etc.. whereas in KVS streams, the producer is entirely isolated from the consumer. Having the Producer library based on curl MULTI interface (or another client that handles pause/unpause properly) will make it very fast but still the producer and the consumer sides will be independent.

jlong29 commented 4 years ago

The thing that is limiting the e2e latency is that long, > 1000 ms, periodic latency from the producer to KVS. That is a problem with libCurl? We went with KVS instead of WebRTC because KVS stores the data natively and WebRTC would have required more engineering to support this function (streaming duplication, using the TURN server to send data for storage, etc). The current KVS behavior is really close to what we want. If there is an opportunity to shave 300-400 ms off, we would be very happy.

jlong29 commented 4 years ago

@MushMal I see that @sharadrj has been assigned to this issue. Any updates on the status of this issue?

MushMal commented 4 years ago

@jlong29 Unfortunately, as I mentioned in my earlier correspondence, libCurl does not expose any means of resuming the event loop. We need to modify the SDK code to use curl's MULTI interface instead. This is a non-trivial task and will take a while to implement and most importantly, we need to prioritize and schedule this work properly. I really don't have an ETA for now.

I understand your scenario of needing low-latency and persistence. Unfortunately, TURN is designed the way that it can't "crack" the DTLS so persistence will not be possible with TURN.

There are two solutions short term:

1) Have a dual-stream type of mode where the low-latency streaming on-demand is implemented via WebRTC and the persistence is done via KVS streaming. 2) Have a webRTC peer on the cloud that can do kind of an SFU functionality where it can do dual streaming.

jlong29 commented 4 years ago

@MushMal Thanks for the suggestions. Though, unfortunately, dual streaming is not an option for us at this time. In fact, avoiding dual streaming is the reason we went with KVS in the first place. We will make the best of the longer than expected latency for now. Please keep me posted if there are any updates.

MushMal commented 4 years ago

Setting "enhancement" label. The only way to fix this moving forward is to move to MULTI interface for the curl solution. This needs to be properly evaluated.

rgroleau commented 3 years ago

This "enhancement" would be greatly appreciated for our own products too (qualcomm-based platform, omxh264enc, same behavior).

MushMal commented 3 years ago

@rgroleau could I ask you to email kinesis-video-support@amazon.com with your scenario/use case so we can have a better understanding of your needs and be able to follow-up with you directly?

hassanctech commented 3 years ago

FYI https://github.com/awslabs/amazon-kinesis-video-streams-producer-c/pull/191

Notes there but I decided against CURL MULTI for now, a more trivial change yields same results as MULTI. When we change the threading model in the future to not create a new thread for each one of the callback handlers then we may switch to MULTI.

hassanctech commented 3 years ago

@jlong29 Can you please try this: https://github.com/awslabs/amazon-kinesis-video-streams-producer-c/pull/204

jlong29 commented 3 years ago

@hassanctech Thanks for following up on this issue. I'm happy to evaluate this commit, though I will need a few days. I am currently working on an important deadline. I am using the cpp sdk gstreamer plugin, and I would like to test against that, so the quickest test of the commit seems to be the following (lmk if anything more expedient comes to your mind):

Pull the most recent master of the producer cpp sdk
cd into /open-source/ amazon-kinesis-video-streams-producer-c/ and pull the most up to date produce c sdk
Checkout your branch fix-curl-burstiness
Re-build the cpp SDK with gstreamer plugin kvssink
Run a kinesis video stream
Receive back on my host laptop using the aws cli
I'll pass the return mkv files to my parser, which will make them indigestible to a media player while also pulling out the tags related to the available timestamps e.g. how long the fragments hung out in aws. This will provide us with the information we need.

hassanctech commented 3 years ago

Thanks @jlong29 it would be helpful for a few people to verify before I merge the changes.

If you need to use kvssink then yeah that's probably the quickest way to test.

FYI For my tests I modify the video only streaming sample from Producer C SDK to overwrite the pts/dts with current time immediately before calling putFrame. Then on the receive side I diff the frame pts/dts with current time.

disa6302 commented 3 years ago

@jlong29 ,

FYI, our top of master today uses CMake dependencies file to modify the commit and does not use open-source anymore. You can modify the commit in the producer C file under Cmake/Dependencies.

jlong29 commented 3 years ago

@hassanctech Thanks for following up. Perhaps I'll stay with your suggestion to test just the new commit prior to validating interoperability.

@disa6302 Thanks for the heads up regarding the SDK changes

rejeangroleau commented 3 years ago

I tried the pull request 204 on my side, on my relatively-slow cpu-wise arm7ahf-neon device with a Qualcomm-based DSP (omx) and gst queues (threads) for the video and audio: The end result is acceptable (the cpu usage is not drastically increased, and the visible live-streaming-scenario burstiness is slighlty reduced from hard-to-notice to almost-unnoticeable on a good connection).

I know this is all very subjective (and specific to our slow-clock IoT DSP-based LTE-cellular-modem device) so I guess this is not a generic comment about the other platforms/devices mentioned in this thread. I guess the 30ms value right now (and the max of 150ms) is highly dependant on various factors (connection speed/latency? and/or FPS? I really don't know, just speculating here). In our case this seems to match quite well with our already-very-bursty cellular-modem connectivity. So it seems good to me (for our very specific platform use-case).

hassanctech commented 3 years ago

@rejeangroleau Thank you for trying it out! Yes it will be impacted by those factors but right now it's not adaptive but it could become so in the future.

Would you mind sharing your fps range? From what I saw in my tests with ~20fps I don't see any bursty behavior (i.e. curl pause is not invoked). We can make those sleeps or iterations longer to better suit your use case if you expect lower than 8fps for example.

A 30ms sleep is a small price to pay vs 1s added delay due to curl pause, so I'd rather be overly optimistic and unnecessarily sleep a tad bit longer than sleep and have to wake up in a few milliseconds. Ideally we could keep iterating up to 500ms which would mean ~ 17 iterations, not sure if that's necessary at the moment.

rejeangroleau commented 3 years ago

@hassanctech I think you are not seeing any pause because you have too much bandwidth available in your test scenario. You could use command line tools such as throttle (and if you want to limit CPU usage too, you can use cpulimit). We typically stream anywhere between 15fps and 30fps right now. But we do have a slow device, and a slow connection.

utsavrai commented 3 years ago

I am experiencing the same issue but I have both powerful machine and good network. For me the rendered frames on parser side gets a periodic delay after every 1-2 sec. I am using http://wiki.ros.org/kinesis_video_streamer to put media on kinesis video stream.

hassanctech commented 3 years ago

@rejeangroleau @utsavrai Can you please try the latest here: https://github.com/awslabs/amazon-kinesis-video-streams-producer-c/pull/204 I added some exponential back-off here with the latest commit and I think there's enough room in there now to keep checking for data for up ~ 600ms in case of slow network.

rejeangroleau commented 3 years ago

@hassanctech It still works well (as in, better than the head of master) on my target platform, although I don't think I am hitting the exponential backup. Looks good to me!

hassanctech commented 3 years ago

@rejeangroleau I can walk you through adding some debug logging to see if you're entering curl pause (and how often) then perhaps we just need to increase the number of iterations. Let me get this changed pushed into master, hopefully sometime today, and then we can go from there.

hassanctech commented 3 years ago

The change is in master so I'm closing the issue. @jlong29 @rejeangroleau @utsavrai please feel free to open a new issue if this problem persists.

jlong29 commented 3 years ago

@hassanctech Thanks for all the solid work on this issue. I'm excited to try it out. Unfortunately, after filing the issue, I've been pretty slammed. Recently, I've been lead engineer on a federal contract that has me very busy. This is all to say that I appreciate your work on this issue, and I hope you don't find my inability to participate at this stage discouraging.

hassanctech commented 3 years ago

@jlong29 No worries! I just wanted to make sure you were aware that there is an improvement available! Feel free to update whenever you do get a chance to try it out.

jlong29 commented 2 years ago

@hassanctech Hello from the future! I have been able to test out the improvement, and it is great. Buffers are getting pushed to the cloud much more quickly. I see an average end-to-end latency of ~400-500 ms now. This is great. I need to handle buffer overflow a little more rigorously now, but this is OK.

awslabs / amazon-kinesis-video-streams-producer-sdk-cpp

Latency and burst observed on GetMedia #490