streams taking much time with startselector type as continuationToken

aws / amazon-kinesis-video-streams-parser-library

Amazon Kinesis Video Streams parser library is for developers to include in their applications that makes it easy to work with the output of video streams such as retrieving frame-level objects, metadata for fragments, and more.

Apache License 2.0

103 stars 52 forks source link

streams taking much time with startselector type as continuationToken #94

Closed varun-tangoit closed 4 years ago

varun-tangoit commented 4 years ago

Hi,

We are trying to poll kinesis_stream with start selector type as continuation_token, while getting streams with token each frames poll time takes much time. For example, i started polling streams at 00:00 and 100 frames it took time 25 mins. Is there any configuration needs to change or anything i missed. Kindly give me some suggestion on improve poll time of kinesis_stream with continuation_token.

MushMal commented 4 years ago

@varun-tangoit could you please give a little more details about your scenario? Can you include the logs?

varun-tangoit commented 4 years ago

Thanks @MushMal , Yeah sure, We are deploy this parser library in ec2 instance(t2.large machine) with i just few modification did, instread of startselectorType NOW , i have used , start the getmedia worker with continuationtoken. But we can't able get streams realtime..

varun-tangoit commented 4 years ago

Hi @MushMal, Here i have attached log_info of each frame when it is created and next continuation token polling time. new_log.log

MushMal commented 4 years ago

@varun-tangoit it's not possible to tell from your client consumer-side application what it's doing or what's going on.

I see the following 2020-05-14 06:12:48 ERROR GetMediaWorker:96 - Failure in GetMediaWorker for streamName cc14ff03f2fdea02802956d49c2992f8 java.util.NoSuchElementException: No value present

this might be causing some restarts or failures in your job.

This is likely related to a width/height not being present in the fragment.

If that's the case, there are number of reasons why it's not there.

could you please try the following

include debug logs from your producer side
specify what content type you are using. For video it has to be "video/h264" in order to extract these and generate into the MKV
Can you tell us how you specify the codec private data to the producer SDK? Is this specified directly or auto-extracted?
Can you capture the CPD so we can analyze whether there is a problem with parsing it in order to extract the width and height

varun-tangoit commented 4 years ago

Hi @MushMal. Thanks for the response. @MushMal, That "no such element exception" we are checking with producer side and you have seen wrong exception. I have asked question, especially for consuming video streams with startselectortype as continoustoken, i didnt get streams in realtime. Whenever producer start producing streams i cant consume in realtime manner, that one is lag we are facing right now aproximately 1 to 2 hours delay in consuming from streams.kindly suggest on this usecase without delay or any configuration i missed.

MushMal commented 4 years ago

@varun-tangoit depending on the network conditions, etc you might have a delay of a few seconds but certainly not minutes or even hours as you are stating.

Having almost no information provided by you, I can only deduce that you are either not using the start selector properly or you might have issues with the timestamps.

In order to use realtime (aka the head of the stream) you need to specify NOW as the start selector per https://docs.aws.amazon.com/kinesisvideostreams/latest/dg/API_dataplane_StartSelector.html#KinesisVideo-Type-dataplane_StartSelector-StartSelectorType
If you are using the timestamp selector, make sure you specify the right timestamp choice for your application. The PRODUCER_TIMESTAMP will use timestamps from within the stream itself which is what's generated on the producer device using producer clock integration which might be adrift.
Make sure your produce does not fall behind due to network pressures on the producer side.

Recommend trying to troubleshoot the issue by specifying the proper selector.

Please update the thread by providing detailed information on your application integration mode, etc..

varun-tangoit commented 4 years ago

@MushMal thanks for the quick reply. Yeah you have said right. But currently we try to run simultaneously 25+ streams, it will increase above 100+. Those situation each consumer jobs run parallel, whenever any of the stream gets drop or stopped, we need to start where we left thats reason i took startSelectorType as continuoustoken. we dont want lose any data thats why. But this scnerio also consuming streams taking much time. what would suggest based on our usecase with efficiently consume in realtime manner. and even if i try to use startselectorType with NOW it also taking as earlier mentioned 1 to 2 hours lag.

MushMal commented 4 years ago

Your scenario is certainly one of the mainstream scenarios. You could indeed use other selectors that will give you media from the "past". In this case, the media will be retrieved "faster-than-realtime" based on the bandwidth you have and the ability of your application to process the fragments to "catch-up" to the head. You really need to understand where your bottleneck is regarding the bandwidth or the consumption rate. Are you running your "consumer" application on a backend EC2 instance for example on the same region? If so, the bandwidth intra-datacenter is super fast so it's not going to be the factor but the processing speed will be. For example, in this case, if you are parsing the GetMedia output MKV and running some sort of vision algorithm then the processing itself will be the bottleneck likely.

varun-tangoit commented 4 years ago

Yes @MushMal. We are currently running ec2 instance type is (t2.xlarge machine with currently running 25+streams) and that machine also same region US-EAST-1 and we didnt do vision algorithm anything. we just consume video save it as images and push into s3 when it reaches 100 images and we didnt do any intensive jobs just consume push to s3 thats it. In "Past" selector which means you said "EARLIEST" option?. we have tried as you told above NOW, Earliest, continous_token everything i get lag. could you suggest me for this case or what we missed?

MushMal commented 4 years ago

Let's start with some clarity.

Try the selector NOW and server_timestamp to ensure you are getting the latest. Start your producer and see how far behind is your consumer. Let us know how you determine the "lag" - how do you detect it and how you quantify it. For example, there is an overall latency that's a constant time behind the producer clock, this is expected as there is network latency and some minimal processing time in the KVS backend. Then, determine if the latency or the lag increases as you continue consumption. If it does, that means something in your consumer might be taking time. For example, you've mentioned that you are consuming video - does it mean you are decoding the h264 video stream? If so, check the CPU load/utilization. Also, you've mentioned that you are persisting frames into S3 - if it's done in synchronous manner on a say 30 fps stream, you will be hitting S3 synchronously 30 times a second and you will hit latencies.

It's hard for me to debug your application based on a few symptoms you are describing.

varun-tangoit commented 4 years ago

@MushMal Yeah,

if we try to NOW as startselectorType, we will now get data present at particular poll to kinesis, in that we end up in losing data( wont get time series data). We need continuous data to do further processing
We currently using t2.xlarge machine or did you suggest anyother machine currently we are running 25+ streams.
Yeah we didnt do any intensive jobs and also we dont push frequently hitting s3, we write the images into local folder, when it reaches 100 only i pushing to s3. Kindly help us to resolve base on our usecase, without lose data consume streams in realtime manner.

MushMal commented 4 years ago

Have you troubleshooted your consumer application as I suggested to understand where the bottlenecks are? If the consumer is running on the same region as your stream then the intra-datacenter network is super fast and it won't be the cause of not being able to catch up.

Also, what's your retention period? If the selector falls behind the trim horizon the earliest fragment will be returned.

MushMal commented 4 years ago

Important, are you seeing this as a regression on your already production device? Or, you are developing this application yet?

MushMal commented 4 years ago

I am going to resolve this issue as by design as in your use case you are requesting data from the past