Old "pre-live" caption ended up in harvested video

awslabs / live-streaming-with-automated-multi-language-subtitling

Live Streaming with Automated Multi-Language Subtitling is a project to automatically generates multi-language subtitles for live streaming web video content. Adding subtitles to your live video content can help improve reach and access, exposing your content to a much larger audience.

Apache License 2.0

160 stars 96 forks source link

Old "pre-live" caption ended up in harvested video #51

Open taschmidt opened 3 years ago

taschmidt commented 3 years ago

We have an automated post-process step that runs a MediaPackage harvest job. The resulting harvested video had, in the subtitles embedded in the HLS stream, someone's commented that was said before we went live (a couple minutes before according to one of our editors):

Note the undesired "Don't ask me hard questions" comment. How do we avoid this? I can't even think of an easy way to scrub this comment since it's now embedded in several of the *.TS files in our S3 bucket.

eggoynes commented 3 years ago

Hi @taschmidt

You should be able to easily delete the text in your S3 bucket. If you look at the first few VTT caption files in your S3 bucket they can be opened as a text file, and with show caption inside. You can delete the text and re-upload. The S3 console makes you download and re-upload files. If you want it even easier a program like CyberDuck or Transmit lets open files to edit in a text editor and auto uploads when you save the file. Let me know if that works.

video_1.vtt

WEBVTT

00:00:00.500 --> 00:00:02.000
Don't Ask me any hard questions.

Then Change that to this.

WEBVTT

00:00:00.500 --> 00:00:02.000

taschmidt commented 3 years ago

So yeah, that's exactly what I ended up doing but it was definitely a hassle. Any idea why that stale text showed up in the harvested stream and how we can avoid this happening every time?

eggoynes commented 3 years ago

Hi @taschmidt

Currently that is the easiest way. You could have a script blank out the first VTT text file in S3. Also GitHub issues are public I removed your video URL just in case, I have it copied here on my end.

If you make the first VTT file empty after job harvest, the stream should still work as planned. I will think about this to see if I have an even easier automated way.

WEBVTT

00:00:00.500 --> 00:00:02.000
Don't Ask me any hard questions.

To empty

taschmidt commented 3 years ago

Right, but is there a reason why the stream started with an old caption? When we run a MediaPackage harvest job, what controls what subtitles are used? Will the stream always start with the last thing said no matter how long ago?

eggoynes commented 3 years ago

Hi @taschmidt. Some background information is that everything that gets sent to AWS MediaPackage including the captions will get saved when a VOD asset is created, and there is not a way to edit what is already ingested into MediaPackage.

So in order to remove that initial caption you are mentioning, and get the behavior that you want I believe if you reduce the TTL for items in the Dynamo DB database that is created that may have the desired effect. You will have to test this. But basically what is happening is if someone says something before the stream starts, that gets saved to Dynamo, then the Lambda@Edge function that inserts the captions will keep that caption up on the screen. With a lower TTL in Dynamo older captions will get cleared out faster and you would not see them up on the screen when watching the AWS MediaPackage HLS.

There could be other ways that could be achieved in Lambda@Edge such as removing the caption from Dynamo after it is used in Lambda@Edge. Putting this in the backlog for now.

taschmidt commented 3 years ago

Looks like the current TTL is 10 minutes? (link)

I'm wondering if this would help the problem since the comments we would like excluded could have been spoken as little as 10 seconds before the start time. If we even cut it in half down to 5 minutes, that comment would still appear right?

I'm not terribly familiar with the logic in the edge lambda, but is there some logic that could prevent "in process" captions from being sent (i.e. those where the start time is BEFORE the current time)? Or if that's not doable, it looks like our VTTs start at zero. Could that first one be omitted if the start time is prior to that?

WEBVTT

1
00:00:13.440 --> 00:00:17.900
Hello and welcome to a special edition
of community conversations. I'm

eggoynes commented 3 years ago

Hi @taschmidt

Didn't see the reply, sorry for late response. The Lambda@Edge takes the newest captions from AWS Dynamo. So if the TTL is set at 2 minutes that should clear out any old captions that are older than 2 minutes if that makes sense. Would that help? There is some old logic that can mark the old captions completed, that is off be default though.