Support multiple languages in TTML

NhanNguyen700 commented 1 month ago

Hi,

This is valid in TTML:

<?xml version="1.0" encoding="UTF-8"?>
<tt xml:lang="en" xmlns="http://www.w3.org/2006/10/ttaf1" xmlns:tts="http://www.w3.org/2006/04/ttaf1#styling">
  <head>
    <metadata xmlns:ttm="http://www.w3.org/2006/10/ttaf1#metadata">
      <ttm:copyright>TVB (c)</ttm:copyright>
    </metadata>
    <styling>
      <style id="1" tts:textAlign="center" tts:color="transparent" tts:fontFamily="Verdana" tts:wrapOption="wrap" />
    </styling>
  </head>
  <body>
    <div xml:id="captions" xml:lang="eng">
      <p begin="00:01:58:040" end="00:01:59:920">eng text</p>
    </div>
    <div xml:id="captions" xml:lang="zho">
      <p begin="00:02:09:760" end="00:02:11:280">zho text</p>
    </div>
  </body>
</tt>

After parsing it and write it as TTML again, I expect that we still have two div tag with different languages, but I have this:

<tt xmlns="http://www.w3.org/ns/ttml" xml:lang="en" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:tts="http://www.w3.org/ns/ttml#styling">
    <head>
        <metadata>
            <ttm:copyright>TVB (c)</ttm:copyright>
        </metadata>
        <styling>
            <style xml:id="1" tts:color="transparent" tts:fontFamily="Verdana" tts:textAlign="center" tts:wrapOption="wrap"></style>
        </styling>
        <layout></layout>
    </head>
    <body>
        <div>
            <p begin="00:01:58.000" end="00:01:59.000">
                <span>eng text</span>
            </p>
            <p begin="00:02:09.000" end="00:02:11.000">
                <span>zho text</span>
            </p>
        </div>
    </body>
</tt>

Languages are gone, and texts are merged into one div tag.

I am looking for a way to fix this, but with the current structure of the lib, It is hard to achieve that without breaking anything.

asticode commented 1 month ago

I've the feeling that adding a Language attribute to Item would do the trick but I may be missing something 🤔 On reading the ttml, language attribute should of an item would be update accordingly and on writing, we could either repeat the xml language attribute for each item (which would be simpler in the code), or add separate divs if we detect an item with a language 🤔

NhanNguyen700 commented 4 weeks ago

All the parsing will return the result as object Subtitles, and there is only one master language for the whole object, we can not know which Items belong to which language, that's why. If we want to fix it, we will break the Subtitles object structures and affect user of this library, they need to change their code to adapt with new structure. There is a way to achieve fixing the issue by storing language for each Subtitles Items, yeah, just like what you said, but it sounds inefficiency. But seems like that it is the only way for this current structure. And then, a question pop up. When converting the multiple languages TTML to WebVTT (or other formats), should we output multiple WebVTT files? I think it is yes, we should append the language into the name of WebVTT file for distinguishing them.

asticode commented 4 weeks ago

If we want to fix it, we will break the Subtitles object structures and affect user of this library, they need to change their code to adapt with new structure

Which changes are you thinking about? 🤔

NhanNguyen700 commented 4 weeks ago

Nothing special, just do not store Items directly in Subtitles, we can have some kind of Wrapper that store metadata (contains languages) and Items, then Subtitles can include that Wrapper. Another way is returning a list of Subtitles objects which different languages when parsing from the input, instead of just returning only one object like what we are doing currently. Those are my thoughts.

asticode / go-astisub

Support multiple languages in TTML #107