Rules for splitting sentences on punctuation characters in subtitles

echogarden-project / echogarden

Easy-to-use speech toolset. Written in TypeScript. Includes tools for synthesis, recognition, alignment, speech translation, language detection, source separation and more.

GNU General Public License v3.0

181 stars 17 forks source link

Rules for splitting sentences on punctuation characters in subtitles #19

Open rotemdan opened 1 year ago

rotemdan commented 1 year ago

Hi, I have a suggestion. When using the transcript alignment function, could the program try to split sentences at commas and periods as much as possible? Sometimes a sentence doesn't reach the characters limit, but it is still divided into two subtitles by the program. BTW, the transcript is Chinese

Originally posted by @Tendaliu in https://github.com/echogarden-project/echogarden/issues/18#issuecomment-1664935898

rotemdan commented 1 year ago

I'm not 100% sure what you are asking about:

If you are asking for line splitting to occur more frequently when punctuation is encountered, you can use this setting:

subtitles.minWordsInLine: minimum number of words in a line, such that a line break can be added. Defaults to 4

The default setting of 4 means that, given the next word doesn't cause the line character limit to be exceeded, a line break would only be added if there are at least 4 words in the current line, even if a phrase break character, like a , or 、 is encountered.

If you lower this value to 1, a line break would be added after every time a , or 、 character is encountered (given there is at least one word before it).

If you increase this value, to a large number, say 1000, line breaks would only be added if the next word exceeds the maximum characters for a line (defined in subtitles.maxLineWidth), phrase break characters would be ignored.

(Note: subtitle rules are not only relevant for alignment but also for speech synthesis and recognition.)

Tendaliu commented 1 year ago

Forgive my confusing expression, what I meant is the situation like this:

微信截图_20230804145548

rotemdan commented 1 year ago

The code that decides whether to break after a word currently uses this logic:

if (isLastWord ||
    lineLengthWithNextWordExceedsMaxLineWidth ||
    (remainingTextExceedsMaxLineWidth &&
    lineLengthExceedsHalfMaxLineWidth &&
    (wordsRemainingAreEqualOrLessToMinimumWordsInLine || followingSubstringIsPhraseSeparator))) {
    // Split..
}

Looking at this again, subtitles.minWordsInLine actually does something a little bit different from what I thought. It sets the minimum number of words remaining for the next cue or line to enable a break or a new cue to be made, since we don't want lines that have only a few words. I should correct the documentation to reflect that and prevent confusion.

Even if you set subtitles.minWordsInLine to 1, then based on these rules the part of the cue you highlighted would not be broken to a new line. The reason is that I added an additional rule that says that the line character count must exceed half the maximum line width in order to break at all (unless there's a sentence break). There are other factors, and overall a complex decision process that takes into account the number of remaining words in the sentence and the existence of clause break characters.

Anyway, the goal was to make subtitles that resemble what humans would do (like what you see on human-edited subtitles on YouTube), and to prevent very short cues that would be hard to read.

The subtitle conversion was not originally designed to format alignment results that are meant for a more technical style, whose goal is, say, to simply group several words in a cue. It was meant to produce subtitles for use in videos.

I can add more modes and rules to the subtitle generation, but these rules would produce subtitles that are not that useful for the average user, and are not easy to read on the screen.

I can also add clause entries to the JSON timeline (actually, they already exist, and I remove them, to reduce complexity of the timeline).

If you had clause entries in the timeline, meaning, sentence entries would have a timeline of clauses (phrases), would it be useful for your needs?

Edit: I had an error in the text. I forgot the word "half", that was important:

The reason is that I added an additional rule that says that the line character count must exceed half the maximum line width in order to break at all

Tendaliu commented 1 year ago

I apologize for not being clear again. In Chinese, "他要求打开的就是我的插件里面的," should be considered as one cue. That is to say, a reasonable subtitle format should be like this:

11 00:00:25,556 --> 00:00:28,193 他要求打开的就是我的插件里面的，

12 00:00:28,193 --> 00:00:30,474 一个重要程序。

However, "的," has been placed into the next line of subtitles.

rotemdan commented 1 year ago

Thanks for clarifying. If that's the case, that may not be a problem related to subtitle generation at all!

It may have to do with the library and algorithm I use to segment Chinese text to words.

The usual library I use for segmentation, for almost all languages, is cldr-segmentation, but it cannot handle Chinese text. It simply splits it by spaces and break characters.

So, later during development, I added the use of a custom package called jieba-wasm, only for Chinese. (I also added a different package for Japanese)

jieba doesn't actually provide a final segmentation, but a set of possible split points. This made it very hard for me to use it, since I have no ability to read any Chinese.

So I asked ChatGPT for assistance:

Together we came up with this algorithm on how to come up with a final segmentation:

export async function splitChineseTextToWords_Jieba(text: string, fineGrained = true, useHMM = true) {
    const jieba = await getWasmInstance()

    if (!fineGrained) {
        return jieba.cut(text, useHMM)
    } else {
        const results = jieba.tokenize(text, "search", useHMM)

        const startOffsetsSet = new Set<number>()
        const endOffsetsSet = new Set<number>()

        for (const result of results) {
            startOffsetsSet.add(result.start)
            endOffsetsSet.add(result.end)
        }

        const startOffsets = Array.from(startOffsetsSet)
        startOffsets.sort((a, b) => a - b)

        const endOffsets = Array.from(endOffsetsSet)
        endOffsets.sort((a, b) => a - b)

        const words: string[] = []

        for (let i = 0; i < startOffsets.length; i++) {
            const wordStartOffset = startOffsets[i]

            function getWordEndOffset() {
                if (i < startOffsets.length - 1) {
                    const nextWordStartOffset = startOffsets[i + 1]

                    for (let j = 0; j < endOffsets.length - 1; j++) {
                        const currentEndOffset = endOffsets[j]
                        const nextEndOffset = endOffsets[j + 1]

                        if (currentEndOffset >= nextWordStartOffset) {
                            return nextWordStartOffset
                        } else if (
                            currentEndOffset > wordStartOffset &&
                            currentEndOffset < nextWordStartOffset &&
                            nextEndOffset > nextWordStartOffset) {

                            return currentEndOffset
                        }
                    }
                }

                return endOffsets[endOffsets.length - 1]
            }

            const wordEndOffset = getWordEndOffset()

            words.push(text.substring(wordStartOffset, wordEndOffset))
        }

        return words
    }
}

You can see the original source code, here.

I don't really know if this approach does a good job. Based on the feedback I got from ChatGPT, it was mediocre. I compared it to the output of Microsoft Azure cloud service and it seemed very different.

Despite being a mediocre result, I left this code to segment Chinese, since I realized it was better than nothing, I guess.

If you have suggestions on how to improve this, that would be great. Maybe there's a better library, but I couldn't find it (maybe there are projects in GitHub for this that are only written in Chinese so I didn't notice them).

rotemdan commented 1 year ago

Maybe you feel that using coarse-grained segmentation would be better than what is done now? Here are examples from the Jieba repository readme of different results:

This is the coarse grained mode:

cut("中华人民共和国武汉市长江大桥", true);
//
[ '中华人民共和国', '武汉市', '长江大桥' ]

This is the result of tokenization that I use in the fine-grained mode:

tokenize("中华人民共和国武汉市长江大桥", "search", true);
//
[
  { word: '中华', start: 0, end: 2 },
  { word: '华人', start: 1, end: 3 },
  { word: '人民', start: 2, end: 4 },
  { word: '共和', start: 4, end: 6 },
  { word: '共和国', start: 4, end: 7 },
  { word: '中华人民共和国', start: 0, end: 7 },
  { word: '武汉', start: 7, end: 9 },
  { word: '武汉市', start: 7, end: 10 },
  { word: '长江', start: 10, end: 12 },
  { word: '大桥', start: 12, end: 14 },
  { word: '长江大桥', start: 10, end: 14 }
]

Then, based on this information, I use the complex algorithm I showed to try to find reasonable split points.

Maybe this idea is bad, and I should just use the simple coarse-grained results above?

Tendaliu commented 1 year ago

"In Chinese, punctuation marks such as { ，。！？ ” …… } as well as white-space and "\n", can all serve as very good bases for splitting subtitles. Chinese sentences are usually quite short, and the number of characters of one cue after splitting(with these punctuation marks) generally won't exceed 20. Moreover, Chinese subtitles usually only have one line per timeline."

rotemdan commented 1 year ago

The subtitle generation algorithm is based on reflowing of individual words to fit into the maximum characters allowed per line. In many cases it isn't enough just to know the phrase boundaries. There are very frequent situations where a phrase has to be broken in some way, this mean that a run of characters like "中华人民共和国武汉市长江大桥" may need to be broken apart.

This is done in every language, the only difference here is that in Chinese and Japanese individual "words" aren't necessarily separated by spaces, so there may be long runs of characters that can't fit into a single line. Also, one of the goal of providing alignment and timing is to provide it for individual units, not just phrases.

In order to get this, I have to break these runs of characters in some way.

Do you feel that using the coarse grained result [ '中华人民共和国', '武汉市', '长江大桥' ] that Jieba provides using cut is good enough? All it takes for me to switch to this mode by default, is to change very little code:

export async function splitChineseTextToWords_Jieba(text: string, fineGrained = true, useHMM = true)

export async function splitChineseTextToWords_Jieba(text: string, fineGrained = false, useHMM = true)

Tendaliu commented 1 year ago

Yes coarse grained result is good enough and I think If the subtitles, when split by punctuation marks, do not meet the criteria, such as the character limit, only then should we consider using coarse-grained segmentation or the fine-grained mode.

rotemdan commented 1 year ago

I published version 0.10.5, in which the Chinese segmentation is now using coarse mode, so you can test how it works.

Note! Now, when playing the result in the terminal, large groups of characters would be shown at once as individual words! (it may seem worse than before)

Ideally it would possible to have a fine-grained segmentation for previewing and coarse-grained for the timeline and subtitles but that is not easy to do at all! (I could try to break down the coarse words further but it would require applying some algorithm, which would be costly, and require some sort of individual alignment to get the timing right, let me know if you have ideas for that)

A better general solution is simply to find a better library! Please let me know if you can find anything better than Jieba!

rotemdan commented 1 year ago

Here is a part of the original documentation for the Python version of Jieba:

#encoding=utf-8
import jieba

seg_list = jieba.cut("我来到北京清华大学", cut_all=True)
print("Full Mode: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "/ ".join(seg_list))  # 默认模式

seg_list = jieba.cut("他来到了网易杭研大厦")
print(", ".join(seg_list))

seg_list = jieba.cut_for_search("小明硕士毕业于中国科学院计算所，后在日本京都大学深造")  # 搜索引擎模式
print(", ".join(seg_list))
Output:

[Full Mode]: 我/ 来到/ 北京/ 清华/ 清华大学/ 华大/ 大学

[Accurate Mode]: 我/ 来到/ 北京/ 清华大学

[Unknown Words Recognize] 他, 来到, 了, 网易, 杭研, 大厦    (In this case, "杭研" is not in the dictionary, but is identified by the Viterbi algorithm)

[Search Engine Mode]： 小明, 硕士, 毕业, 于, 中国, 科学, 学院, 科学院, 中国科学院, 计算, 计算所, 后, 在, 日本, 京都, 大学, 日本京都大学, 深造

There is a mode when cut_all=True that is supposed to produce a "full" result, unlike the "default" (coarse or "accurate") result you're getting right now.

I'm using a WASM port of it. It has a method called cut_all but this method doesn't actually split to words, instead, for "中华人民共和国武汉市长江大桥" it produces a list of overlapping, possible words, which I'm not really sure what to do with:

[
  "中",
  "中华",
  "中华人民",
  "中华人民共和国",
  "华",
  "华人",
  "人",
  "人民",
  "人民共和国",
  "民",
  "共",
  "共和",
  "共和国",
  "和",
  "国",
  "武",
  "武汉",
  "武汉市",
  "汉",
  "市",
  "市长",
  "长",
  "长江",
  "长江大桥",
  "江",
  "大",
  "大桥",
  "桥",
]

Tendaliu commented 1 year ago

Jieba can't explain why “的，” was put in the second line, am I right？I mean “他要求打开的就是我的插件里面的，” complies with the maximum character limit per line requirement.

11 00:00:25,556 --> 00:00:28,193 他要求打开的就是我的插件里面

12 00:00:28,193 --> 00:00:30,474 的，一个重要程序。

rotemdan commented 1 year ago

Currently the subtitle algorithm doesn't split within individual words given to it. The reason 的 was broken into a new line is that the complex algorithm I made to decide on words boundaries from a list of search terms, identified 的 as an individual "unit". (not necessarily a word).

The fine-grained approach I made splits the tokens to possible search terms:

const results = jieba.tokenize(text, "search", useHMM)

And then applied an algorithm to try to guess word boundaries from it.

In the newer version (0.10.5), using the coarse approach, it segments to:

"他" ,"要求", "打开", "的", "就是", "我", "的", "插件", "里面", "的"

In this case, it looks like the segmentation is similar to the previous approach. I see that the last word is considered to be "的" again, so it may not seem like an improvement. It could be a weakness of Jieba itself, that's why I'm suggesting to try to find a better library.

Since I don't understand the Chinese language, I don't know exactly why it makes this decision. The algorithm is based on a Hidden Markov Model and uses statistical information to try to choose the best splits.

However, it is better for me to be more conservative and try to make less mistakes, so it might be better, for now, not to rely on some arbitrary algorithm I made, and use the direct output of the library (If I did understand Chinese, I might be able to come up with a better approach using the output of cut_all maybe).

Edit: you can see the exact segmentation produced in the .json output file (timeline). Whenever you have doubt on why the subtitles were made in some way, you can look at this JSON file to see exactly how words and sentences were segmented.

Tendaliu commented 1 year ago

I utilized the --plainText.paragraphBreaks=single setting and replaced "，" and "。" with "\n". These modifications gave me the text segmentation I was looking for.

rotemdan commented 1 year ago

I was thinking of another option to deal with this:

When a word is broken in most languages and continues on the next line, like, say encyclopedia is broken, to ency and clopedia. Then a dash (-) is often added to the end of the line, like in:

I read the ency-
clopedia yesterday

The reason the dash is added is to signify the reader the next line continues the last word and there is no space or separator between ency clopedia.

We could possibly do the same when Chinese letter groups are broken like:

他要求打开的就是我的插件里面-
的 一个重要程序。

The question is, I have no idea if this is commonly done in Chinese subtitles, I mean even if "的" was actually a standalone word.

Someone else will need to fill me in on this. It may be that adding a rule for this, especially for Chinese would introduce cases where it is not desirable.

Do you have any knowledge about if this is common or a good idea?

Clarification: What I mean is that whenever a break in added between two parts of a character run in a position that was identified by Jieba as a word seperator, then a dash would be added. I don't necessarily mean I would split these runs at any position (unless there is no other choice).

Tendaliu commented 1 year ago

This does not conform to Chinese language habits. Under no circumstances would Chinese use a hyphen (-) to connect two Chinese characters.

rotemdan commented 1 year ago

Alright. Thanks for letting me know, so, if I have a run of characters like 他要求打开的就是我的插件里面的 and I must break it somewhere in the middle, to ensure that the maximum line width isn't exceeded. Do I just break it without adding any signifier symbol?

Currently the subtitle algorithm doesn't deal with long words. For example, in German (which I don't speak), there is a word like "Massenkommunikationsdienstleistungsunternehmen". This word, by itself may be wider than the line width and there's no obvious way to segment it. Right now, it is just left to exceed the line (it is made the only word in a line).

In the future I will modify the algorithm to break inside of it, and most likely I'll add a dash to signify the break:

Massenkommunikationsdienstleistung-
sunternehmen

This is common in languages with Latin characters.

Based on what you are saying I shouldn't add a - in Chinese. Okay. I'll maybe ask ChatGPT about this as well when I get to it.

Tendaliu commented 1 year ago

Yes, you can just directly split it. In Chinese, each character is equivalent to an English word. Also, semantically speaking, if you need to split "他要求打开的就是我的插件里面的", it should ideally look like this:

他要求打开的就是我的插件里面的 or 他要求打开的就是我的插件里面的