echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.
GNU General Public License v3.0
149 stars 16 forks source link

Different subtitle outputs with CLI commands #33

Open GalenMarek14 opened 6 months ago

GalenMarek14 commented 6 months ago

Is there a way to edit subtitle outputs with CLI commands? It would be very good to have formats like this:

1
00:00:00,030 --> 00:00:00,070
<font color="#00ff00">The</font> first sentence.

2
00:00:00,070 --> 00:00:00,080
The first sentence.

3
00:00:00,080 --> 00:00:00,450
The <font color="#00ff00">first</font> sentence.

4
00:00:00,450 --> 00:00:00,530
The first sentence.

5
00:00:00,530 --> 00:00:01,100
The first <font color="#00ff00">sentence</font>.

6
00:00:01,740 --> 00:00:01,780
<font color="#00ff00">The</font> second sentence.

7
00:00:01,780 --> 00:00:01,800
The second sentence.

8
00:00:01,800 --> 00:00:02,250
The <font color="#00ff00">second</font> sentence.

9
00:00:02,250 --> 00:00:02,260
The second sentence.

10
00:00:02,260 --> 00:00:02,800
The second <font color="#00ff00">sentence</font>.

So far the only method I can think of is converting JSON files but it's a bit hard for me as a non-coder.

rotemdan commented 4 months ago

There's no standardized format (that I know of) for word-level subtitles, unfortunately.

The auto-subtitles from YouTube internally use both a custom JSON format like:

    "events": [
        {
            "tStartMs": 0,
            "dDurationMs": 502120,
            "id": 1,
            "wpWinPosId": 1,
            "wsWinStyleId": 1
        },
        {
            "tStartMs": 120,
            "dDurationMs": 7239,
            "wWinId": 1,
            "segs": [
                {
                    "utf8": "great",
                    "acAsrConf": 0
                },
                {
                    "utf8": " paper",
                    "tOffsetMs": 400,
                    "acAsrConf": 0
                },
                {
                    "utf8": " today",
                    "tOffsetMs": 760,
                    "acAsrConf": 0
                },
                {
                    "utf8": " fellow",
                    "tOffsetMs": 1240,
                    "acAsrConf": 0
                },
                {
                    "utf8": " Scholars",
                    "tOffsetMs": 1640,
                    "acAsrConf": 0
                },
                {
                    "utf8": " stable",
                    "tOffsetMs": 2519,
                    "acAsrConf": 0
                }
            ]
        },
        {
            "tStartMs": 3149,
            "dDurationMs": 4210,
            "wWinId": 1,
            "aAppend": 1,
            "segs": [
                {
                    "utf8": "\n"
                }
            ]
        },
        {
            "tStartMs": 3159,
            "dDurationMs": 6841,
            "wWinId": 1,
            "segs": [
                {
                    "utf8": "diffusion",
                    "acAsrConf": 0
                },
                {
                    "utf8": " XL",
                    "tOffsetMs": 800,
                    "acAsrConf": 0
                },
                {
                    "utf8": " turbo",
                    "tOffsetMs": 1761,
                    "acAsrConf": 0
                },
                {
                    "utf8": " why",
                    "tOffsetMs": 2761,
                    "acAsrConf": 0
                },
                {
                    "utf8": " well",
                    "tOffsetMs": 3441,
                    "acAsrConf": 0
                },
                {
                    "utf8": " because",
                    "tOffsetMs": 3881,
                    "acAsrConf": 0
                }
            ]
        },

And also extend the VTT subtitle format using special word timestamp tags:

WEBVTT
Kind: captions
Language: en

00:00:00.120 --> 00:00:03.149 align:start position:0%

great<00:00:00.520><c> paper</c><00:00:00.880><c> today</c><00:00:01.360><c> fellow</c><00:00:01.760><c> Scholars</c><00:00:02.639><c> stable</c>

00:00:03.149 --> 00:00:03.159 align:start position:0%
great paper today fellow Scholars stable

00:00:03.159 --> 00:00:07.349 align:start position:0%
great paper today fellow Scholars stable
diffusion<00:00:03.959><c> XL</c><00:00:04.920><c> turbo</c><00:00:05.920><c> why</c><00:00:06.600><c> well</c><00:00:07.040><c> because</c>

00:00:07.349 --> 00:00:07.359 align:start position:0%
diffusion XL turbo why well because

These are internal formats they use, which I fetched using a special downloader like youtube-dl, but are otherwise not publicly accessible.

I don't know of any software that actually supports these formats for viewing, so I'm not sure what would be the benefit to support them or try to imitate them (However, it could support reading and converting them in the future - but remember that they can only be fetched using special downloaders and not by the official YouTube API, so the priority to implement this is currently low).

The JSON format produced by Echogarden contains a lot of extra linguistic information, like phonetic pronunciation and sub-word timing, and also includes word offsets to the original raw text.