add korean sentence validation & cleanup

sftblw commented 2 years ago

This adds korean validation and cleanup.

Korean script should only allow korean letters and some special characters [.,?!] (in my opinion).
- There are so many things to be not allowd, so I added the rule as whitelist rule.
- For responsive message I added some rules before it, which may be common mistakes.
For sentence length calculation, I replaced it with character length.
- In my opinion, Tokenization is somewhat heavyweight for this task, especially in CJK (Chinese, Japanese, Korean).
- max length 50 is somewhat random, but the famous pengram "키스의 고유조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다" is 35 length so... It would be sufficient.
About normalization
- Korean input system has (mainly) two kind of characters (in unicode space), which are interchangeable but have different unicode code point.
- In my opinion both should be allowed to be submitted, but normalized into one kind of codepoint when It comes to actual usage or sentence storing in DB.
- so I added normalization to both of validation and cleanup.

sftblw commented 2 years ago

Thank you for feedback.

About normalization

Normalization was intended for contributors who use certain kind of keyboard layout, called "Sebeolsik".

Korean letters (Hangul) in unicode has two kind of codepoints.

Composed form (Unicode "Hangul Syllables" : U+AC00–U+D7A3)
- One Unicode codepoint contains three or two letters in rectangular shape.
- This codepoints are normally used.
- ex) col is 콜 which is U+CF5C, taking only one codepoint.
Other forms
- Other Unicode codepoints deal korean letters as separated vowels and consonants.
- This takes doubled space in bytes.
- ex) col is 콜 and it is actually three codepoints, ᄏ (1st k) ᅩ (2nd oh) ᆯ (3rd r).
- This only appears when a contributor is using keyboard layout called "Sebeolsik", which is akin to Dvorak, or It appears when it comes to filename of macOS, since macOS normalize filenames with decomposition normalization (NFD).
Two kinds of codepoint can be converted by each with normalization.
- "콜".normalize('NFD').split('') is Array(3) [ "ᄏ", "ᅩ", "ᆯ" ].
- and "콜".split('') is Array [ "콜" ], proving its one and only codepoint.
- [ "ᄏ", "ᅩ", "ᆯ" ].join('').normalize('NFC').split('') is Array [ "콜" ]

It comes clear if the sentence is normalized with composition normalization (NFC) before the sentence is submitted to the database, but I couldn't find existing way to normalize(morph) the sentence before submitted. Cleanup script is added to complement this in other way while supporting minor cases (And I think that is not clear way to do, too.)

So, It might be handled with one of two ways:

Simply prevent contributors to input decomposed code points (only allowing "Hangul Syllables"), Since this make things clear and those kind of codepoints are quite rarely used (in my opinion).
Add normalization before submitting, changing original sentence (Since NFC is non-destructive normalization, it is perfectly fine to do).

(and /[^가-힣.,?! ]/u only allows composed form, It's my mistake.)

Character length

I checked one of "Public for Korean" dataset from aihub.or.kr, - validation set, random 10 sentences (without actually listening them, believing its metadata to be correct)

validation - broadcast set

data no.	len	second
14	121	15.87
29	38	3.84
43	37	4.48
61	37	3.46
458	97	11.01
466	61	7.3
475	73	10.37
738	44	5.5
826	45	5.12
993	52	5.63

( 605 characters / 72.58 second) = 8.33563 character / second

MichaelKohler commented 2 years ago

Thanks for the explanation. I'm now checking if we could instead add NFC somewhere else (even before validation) to make this simpler. I will report back.

HarikalarKutusu commented 2 years ago

max length 50 is somewhat random, but the famous pengram "키스의 고유조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다" is 35 length so... It would be sufficient.

+

8.33563 character / second

Wouldn't it better to increase the max sentence length? Something like 70? Common Voice recordings are limited to 10 sec recordings, mainly optimized for batch processing in 8GB GPU's. If you limit your max-length further, you may be forced to exclude many sentences, which are hard to come by. I'm asking without knowing the language of course...

MichaelKohler commented 2 years ago

I have checked with others and per-language NFC normalization is something we want to support. Therefore I implemented that capability here: https://github.com/common-voice/sentence-collector/commit/5a86a81a6da7533e9571d2411777ef058a4419bb . I've already enabled it for Korean, however it's not on the live website yet. I will create a new release once we have the validation here as well.

This means:

You do not have to normalize before counting the characters, the sentence passed into the function will already be normalized
The regex rules will be applied on the normalized sentence
We do not need the cleanup changes

Would be great if you could update your branch to be based on the latest main branch, and test out my addition and whether this works correctly for your validation as well.

Happy to provide any more context if needed and thanks for the suggestion!

Wouldn't it better to increase the max sentence length? Something like 70? Common Voice recordings are limited to 10 sec recordings, mainly optimized for batch processing in 8GB GPU's. If you limit your max-length further, you may be forced to exclude many sentences, which are hard to come by. I'm asking without knowing the language of course...

Not by too much though. 70 would mean roughly 8.5 seconds on average. If somebody reads/speaks 20% slower than average, then it would already hit the 10 seconds hard limit.

sftblw commented 2 years ago

Sorry, This PR close was mistake; I'm not familiar with forked Github repository and pressed the "sync" with discard button. I'll reopen after some modification.

sftblw commented 2 years ago

I made a new commit (because of the mistake)
Changed error message to Korean language (If I understood the recommendation correctly)
Removed some validations but left/modified some. Last validation will cover above two validation, but I left them to guide contributors.
- ㅋㅋㅋ means lol (laught sound) and those characters are composed form but only with 1st letter of three (or two).
- Non-composed codepoints will not be covered with [ㄱ-ㅎㅏ-ㅣ], since those codepoints are combined form. so Second one is yet needed (to guide contributors)
- If we don't need guidance, that two validation rule is not required.

MichaelKohler commented 2 years ago

:tada: This PR is included in version 2.18.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket:

common-voice / sentence-collector

add korean sentence validation & cleanup #630

About normalization

Character length