aws-solutions / video-on-demand-on-aws

An automated reference implementation leveraging AWS Step Functions and AWS Media Services to deploy a scalable fault tolerant Video on demand workflow
https://aws.amazon.com/solutions/video-on-demand-on-aws/
Apache License 2.0
504 stars 241 forks source link

Assistance with investigating a VOD V5.0.0 missing job issue #63

Closed skawirat closed 4 years ago

skawirat commented 4 years ago

Hi @dscpinheiro,

Need your assistance in investigating a VOD V5.0.0 workflow issue; where sometimes jobs submitted to VOD does NOT trigger a MediaConvert job.

To give you some background of the issue.

We have configured our VOD stack to trigger the workflow on MetaDataFile. We used the prior VOD version 4.0 for about 4 months and submitted 2890 VOD jobs. For that entire duration the occurrence of the particular issue is zero; as in the issue did not exist in previous version of VOD.

We have been using VOD V5.0.0 for above 2 month and have submitted around 1280 jobs and the particular issue has occurred approximately around 4 times.

Expected Behaviour: After the media file Is added to the source S3 bucket and then the Metadata file is added we expect VOD workflow to trigger an AWS MediaConvert job.

Problem Behaviour: On some occasions after Metadata file is added to S3 we have noticed no corresponding AWS VOD job getting created.

Process of elimination

For the most part the VOD 5.0.0 is Blackbox for me. Therefore it would be great if you can guide me in in expected function flow (function call sequence) as well as any tips on where to look for to identify potential root cause for issue

Thanks in advance

Sam

dscpinheiro commented 4 years ago

Hi @skawirat,

There aren't that many differences in the architecture between v4 and v5, so it definitely shouldn't feel like a black box.

When you upload any file to S3, the first function that runs is step-functions (and this hasn't changed from v4). This function is invoked whenever an event from S3 is received, and the solution has no control over that (if the event is not received, the workflow won't be started and there won't be any logs in CloudWatch).

According to the S3 docs (https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html), "event notifications typically deliver events in seconds but can sometimes take a minute or longer", so we have to consider that there might be a delay.

These 4 files you mentioned, were they uploaded around the same time or separately?

skawirat commented 4 years ago

Hi @dscpinheiro,

Please see my replies below

This function is invoked whenever an event from S3 is received, and the solution has no control >over that (if the event is not received, the workflow won't be started and there won't be any logs >in CloudWatch).

As I mentioned in my 4th bullet point; the VOD workflow did start.. but it doesn't seem to have completed in the sense it did not initiate a corresponding job on AWS MediaConvert. So the workflow seems to have aborted somewhere down the line (note: I also mentioned there weren't any errors logged in the error handler)

These 4 files you mentioned, were they uploaded around the same time or separately?

They had a time gap of around 1 1/2 wees - 2 weeks between them. So there is no relationship between the missing jobs time-wise

Regards,

Sam

dscpinheiro commented 4 years ago

the VOD workflow did start.. but it doesn't seem to have completed in the sense it did not initiate a corresponding job on AWS MediaConvert

That doesn't sound right. What do you see when you go the state machine execution for that particular workflow?

There should be some indication there was a failure. For instance:

Screen Shot 2020-03-03 at 7 04 34 AM

aassadza commented 4 years ago

We are closing this issue, but please feel free to open the issue again if you need any other support.

skawirat commented 4 years ago

Hi,

I would like to re-open this issue again as it occurred again.

This time around I have a bit more information; as I was able to locate a SNS with the following The SNS contains "errorMessage": "'NoneType' object is not subscriptable", related to MediaInfo

Some additional information

I did a google search for error message and found the following https://github.com/awslabs/video-on-demand-on-aws/issues/52

While the error message discussed in above issue seems to be a exact match to what I am getting; the details discussed doesn't match behaviour I observer ( I resubmitted the exact same file with zero changes a 2nd time and things worked as expected whereas in the link above the person had to rename file)

So it seem like the root cause of this error message is still unresolved

Any tips as how to tackle it ?

Thanks

Sam

sns