AcademySoftwareFoundation / OpenTimelineIO

Open Source API and interchange format for editorial timeline information.
http://opentimeline.io
Apache License 2.0
1.45k stars 287 forks source link

Subtitles & captions #62

Open jminor opened 7 years ago

jminor commented 7 years ago

We should support subtitles and captions.

jminor commented 7 years ago

There are several existing formats for subtitles that could be supported via adapters. WebVTT: https://w3c.github.io/webvtt/ SSA: https://wiki.multimedia.cx/index.php?title=SubStation_Alpha SRT, etc.

FCP XML supports <title> elements. AAF probably supports titles too. Someone will need to research this...

Looking at how those formats represent titles, we should be able to model them as a subclass of Marker. Formatting information will likely be hard to normalize across all the formats, but in the spirit of OTIO, we can focus on the content and timing of the titles as that is likely to be the most useful part in practice.

The use cases for subtitles in OTIO seems to relate to swapping in/out localized subtitles, and/or passing subtitles up or downstream in a production pipeline.

jminor commented 7 years ago

Here is some info about how AAF encodes video titles: https://www.amwa.tv/downloads/specifications/AS-05_AAF_Effects_protocol_v1.pdf

I'm not sure if this is the same as a subtitle track. This still needs more research.

jminor commented 6 years ago

Here are some more links for reference, though the license for some of these may not be compatible. http://www.nikse.dk/SubtitleEdit/ https://github.com/ubershmekel/pytitle https://github.com/pyahmed/sub2xml

KarthikRIyer commented 4 years ago

I'd like to attempt this. But I'm still trying to understand how different subtitle formats work.

http://www.nikse.dk/SubtitleEdit/ this seems helpful. It has a common representation for subtitles: https://github.com/SubtitleEdit/subtitleedit/blob/master/libse/Subtitle.cs

Will the license be a problem if we refer to this implementation (because we won't be lifting code directly)?

Jameclarke commented 4 years ago

Here is a great article referencing how subtitles are used. https://jonnyelwyn.co.uk/film-and-video-editing/adding-captions-in-premiere-pro-fix-common-problems/

apetrynet commented 4 years ago

Some of the subtitle formats include positioning so I think this ties into the coordinate system and annotations as well.

773 #771

KarthikRIyer commented 4 years ago

What are annotations?

Are the positions as coordinates or something like align: left align: center? I saw the alignment approach in webvtt I think. If we use that approach we could leave it up to the adapter or the final consumer of the subtitles to decide the actual position, can't we?

apetrynet commented 4 years ago

Annotations can be drawings, text etc overlaid the video. There's been discussions on how to implement this and that got tied to the position coordinate system. Yes, those aligns are the ones I was thinking about. It could be up to the adapters, but if we want to have a common ground schema for several subtitle formats it's worth considering to have a way of storing positions. I don't think waiting for a coordinate system should hold you back from working on this. We can always revisit positioning at a later stage. Just my two cents (since I looked at this issue my self a few weeks ago :)

meshula commented 4 years ago

@KarthikRIyer Since the license you reference is GPL, one should not reference that code, nor transliterate it.

reinecke commented 4 years ago

Hey @KarthikRIyer, check out TTML2 to get a feeling for what can be expressed in timed text: https://www.w3.org/TR/ttml2/

The Netflix Tech blog has some articles about the complexities of timed text (captions and subtitles). Here are a couple of them:

Hopefully that gives a good jumping off point.

KarthikRIyer commented 4 years ago

Thanks @reinecke ! I'll go through these links.

KarthikRIyer commented 4 years ago

Here's what I thought we could start with, based on what I understood from TTML2 and with how I've used captions in Premiere Pro. Please let me know any suggestions/changes/improvements.

subs

meshula commented 3 years ago

Referring to styles is pretty conventional across the board, think CSS. If style is stored per caption, to accommodate the way real systems work, a consumer of a TimedText object would have to perform an initial pass to gather all the styles, and de-duplicate them into a dictionary. It seems to me that an internal referencing scheme for OTIO would be a prerequisite to implementing TimedText in a useful way. As a suggestion, such a referencing scheme could be as simple as saying that there is a concept of named dictionaries on the root object, and the style field in a TimedText refers to an object at e.g. @dict/styles/Japanese_bold. Introducing a root dictionary would open a can of worms of questionable usages, so I think we'd have to be very cautious about unintended consequences. At the same time, style per TimedText seems unworkable IMO.

KarthikRIyer commented 3 years ago

The root dictionary suggestion seems similar to : One other thing I thought of was to have a list of styles inside Subtitle and then have a string id in each TimedText, right?

I agree that style per TimedText isn't workable. What could be the issues with a root dictionary? If we have something like map<string/long, TimedTextStyle>, would that still have questionable usages?

meshula commented 3 years ago

Perhaps it would make sense to store a map of named styles to style templates as a metadata entry on the Timeline object., and that storing that dictionary as a metadata anywhere else would be ignored.

KarthikRIyer commented 3 years ago

Yeah, I can try this out

KarthikRIyer commented 3 years ago

Note: Sample SRT files with formatting/styles for testing here

KarthikRIyer commented 3 years ago

I was looking into parsing styles from SRT files. Specifically this sample.

SRT text is formatted using HTML. For one TimedText, it could be like this:

This should be an E with an accent: È
日本語
<font size=30><b><i><u>This text should be bold, italics and underline</u></i></b></font>
<font size=9 color="00ff00">This text should be small and green</font>
<font color=#ff0000 size=9>This text should be small and red</font>
<font color=brown size=24>This text should be big and brown</font>

So I think I'll need to make some changes to the classes I defined earlier. The current TimedText class has one content string, and one linked style object. I was thinking, It could have an array of content strings, each linked to an optional UID. Each UID would correspond to a style in a map stored as a metadata entry on the Timeline object.

For starters I think handling well defined HTML tags should be ok? There are many cases in the above linked sample file that can be handled, but would require some effort. Like,

>
It would be a good thing to
<invalid_tag>hide invalid html tags that are closed and show the text in them</invalid_tag>
<invalid_tag_unclosed>but show un-closed invalid html tags
Show not opened tags</invalid_tag_not_opened>
<
<font color="#00FF00" size="6">This could be the <font size="35">m<font color="#000000">o</font>st</font> difficult thing to implement</font>
Laurian commented 3 years ago

I'm interested in reading titles and captions out of FCPX (and maybe simple effects like generators), I could try adding them to the adapter but I don't know how they should be represented in OTIO?

KarthikRIyer commented 3 years ago

@Laurian There's a WIP PR (#805) that adds a representation in OTIO. I was working on an SRT adapter, but there's work left to be done on the OTIO representation to support styles.

Laurian commented 3 years ago

Cool, I'll use your TimedText.1 schema to represent things and look into how to do the import/export on the fcp_xml.py adapter for it, as I'm quite familiar with the FCPX format; I can map to that both <title> and <caption> elements, even when <title> points to a generator, the text data is in there and I guess I can "hide" all the other extra bits in the metadata field for now (is there a policy/format on how to pass adapter specific metadata into the timeline?)

meshula commented 3 years ago

I'm curious what the other metadata from FCP contains?

jminor commented 3 years ago

@Laurian yes, the general guidance is that an adapter should translate to/from the OTIO schema (in this case @KarthikRIyer 's proposed TimedText schema) and then to put anything else interesting into metadata. Specifically nested into a sub-dictionary within metadata that is clearly labelled. This makes that metadata visible and invites discussion about what else could/should be promoted into the official schema. You can see more guidance here: https://opentimelineio.readthedocs.io/en/latest/tutorials/otio-file-format-specification.html?highlight=metadata#metadata and here: https://opentimelineio.readthedocs.io/en/latest/tutorials/write-an-adapter.html?highlight=metadata#metadata

meshula commented 3 years ago

This project

https://github.com/naomiaro/waveform-playlist

led me to this -

https://github.com/readbeyond/aeneas

aeneas looks like a gold mine of reference material.

Laurian commented 3 years ago

@meshula FCPX will have for subtitles placement and styling metadata:

<caption name="People assume that time is a strict progression of cause to effect," lane="1" offset="15024/300s" duration="8600/2500s" start="3600s" role="iTT?captionFormat=ITT.en-GB">
  <text placement="top">
    <text-style ref="ts2">People assume that time is a strict progression of cause to effect,</text-style>
  </text>
  <text-style-def id="ts2">
    <text-style font=".AppleSystemUIFont" fontSize="13.01" fontFace="Regular" fontColor="1 0.999974 0.999991 1" backgroundColor="0 0 0 1"/>
  </text-style-def>
</caption>

But other titles you can add can be similar

<title name="Continuous" lane="2" offset="123500/2500s" ref="r6" duration="20100/2500s" start="3600s">
  <text>
    <text-style ref="ts1">Title</text-style>
  </text>
  <text-style-def id="ts1">
    <text-style font="Helvetica" fontSize="72" fontFace="Regular" fontColor="1 0.999974 0.999991 1" strokeColor="0.985948 0 0.0269506 0" strokeWidth="1" alignment="center"/>
  </text-style-def>
</title>

where ref="r6" points to the Apple Motion effect <effect id="r6" name="Continuous" uid=".../Titles.localized/Build In:Out.localized/Continuous.localized/Continuous.moti"/>

Similarly custom titles can have a lot of parameters (here's a BBC News caption one):

<title name="People assume that time is a strict progression of cause to effect, - 02 Subtitle" lane="1" offset="125200/2500s" ref="r7" duration="13209600/3840000s" start="3600s">
    <param name="Layout Method" key="9999/10201/3000298778/10202/2/314" value="1 (Paragraph)"/>
    <param name="Left Margin" key="9999/10201/3000298778/10202/2/323" value="0"/>
    <param name="Right Margin" key="9999/10201/3000298778/10202/2/324" value="0"/>
    <param name="Top Margin" key="9999/10201/3000298778/10202/2/325" value="0"/>
    <param name="Bottom Margin" key="9999/10201/3000298778/10202/2/326" value="-540"/>
    <param name="Alignment" key="9999/10201/3000298778/10202/2/354/10038/401" value="1 (Center)"/>
    <param name="Line Spacing" key="9999/10201/3000298778/10202/2/354/10038/404" value="-14"/>
    <param name="Alignment" key="9999/10201/3000298778/10202/2/373" value="0 (Left) 2 (Bottom)"/>
    <param name="Source Object" key="9999/10201/3000298778/10202/4/3000449521/201" value="3000449347"/>
    <param name="Scale" key="9999/10201/3000298778/10202/4/3000449521/204" value="-0.0925926"/>
    <param name="Apply Mode" key="9999/10201/3000298778/10202/4/3000450074/200" value="1 (Multiply by source)"/>
    <param name="Source Object" key="9999/10201/3000298778/10202/4/3000450074/201" value="3000450130"/>
    <param name="Scale" key="9999/10201/3000298778/10202/4/3000450074/204" value="10"/>
    <param name="Opacity" key="9999/10201/3000298778/10202/4/3001050732/1000/1044" value="0"/>
    <param name="Speed" key="9999/10201/3000298778/10202/4/3001050732/201/208" value="6 (Custom)"/>
    <param name="Custom Speed" key="9999/10201/3000298778/10202/4/3001050732/201/209">
        <keyframeAnimation>
            <keyframe time="0s" value="0"/>
            <keyframe time="454656/153600s" value="0"/>
        </keyframeAnimation>
    </param>
    <param name="Range" key="9999/10201/3000298778/10202/4/3001050732/201/229/230" value="6 (Line)"/>
    <param name="End Index" key="9999/10201/3000298778/10202/4/3001050732/201/229/232" value="3"/>
    <param name="Invert" key="9999/10201/3000298778/10202/4/3001050732/201/229/233" value="1"/>
    <text>
        <text-style ref="ts2">People assume that time is a strict progression of cause to effect,</text-style>
    </text>
    <text-style-def id="ts2">
        <text-style font="BBC Reith Sans" fontSize="58" fontFace="Regular" fontColor="1 0.999974 0.999991 1" alignment="center" lineSpacing="-14"/>
    </text-style-def>
</title>

again the ref="r7" will point to the actual Apple Motion file <effect id="r7" name="02 Subtitle" uid="~/Titles.localized/BBC News/B Info/02 Subtitle/02 Subtitle.moti" src="file:///Users/laurian/Movies/Motion%20Templates.localized/Titles.localized/BBC%20News/B%20%20Info/02%20Subtitle/02%20Subtitle.moti"/>

So I would try to preserve that in metadata just in case I need it.

meshula commented 3 years ago

Thanks ~ what prompted my question was wondering whether the extra data fell in the category of "extra non-portable stuff that should be preserved" or the category of "falls into the styling category". The examples seem to show styling/layout plus some animation parameters.