common-voice / sentence-collector

Tool to collect and review sentences for Common Voice
https://commonvoice.mozilla.org/sentence-collector/
Mozilla Public License 2.0
81 stars 64 forks source link

add korean sentence validation & cleanup #630

Closed sftblw closed 2 years ago

sftblw commented 2 years ago

This adds korean validation and cleanup.

sftblw commented 2 years ago

Thank you for feedback.

About normalization

Normalization was intended for contributors who use certain kind of keyboard layout, called "Sebeolsik".

Korean letters (Hangul) in unicode has two kind of codepoints.

It comes clear if the sentence is normalized with composition normalization (NFC) before the sentence is submitted to the database, but I couldn't find existing way to normalize(morph) the sentence before submitted. Cleanup script is added to complement this in other way while supporting minor cases (And I think that is not clear way to do, too.)

So, It might be handled with one of two ways:

  1. Simply prevent contributors to input decomposed code points (only allowing "Hangul Syllables"), Since this make things clear and those kind of codepoints are quite rarely used (in my opinion).
  2. Add normalization before submitting, changing original sentence (Since NFC is non-destructive normalization, it is perfectly fine to do).

(and /[^가-힣.,?! ]/u only allows composed form, It's my mistake.)

Character length

I checked one of "Public for Korean" dataset from aihub.or.kr, - validation set, random 10 sentences (without actually listening them, believing its metadata to be correct)

validation - broadcast set

data no. len second
14 121 15.87
29 38 3.84
43 37 4.48
61 37 3.46
458 97 11.01
466 61 7.3
475 73 10.37
738 44 5.5
826 45 5.12
993 52 5.63

( 605 characters / 72.58 second) = 8.33563 character / second

MichaelKohler commented 2 years ago

Thanks for the explanation. I'm now checking if we could instead add NFC somewhere else (even before validation) to make this simpler. I will report back.

HarikalarKutusu commented 2 years ago

max length 50 is somewhat random, but the famous pengram "키스의 고유조건은 입술끼리 만나야 하고 특별한 기술은 필요치 않다" is 35 length so... It would be sufficient.

+

8.33563 character / second

Wouldn't it better to increase the max sentence length? Something like 70? Common Voice recordings are limited to 10 sec recordings, mainly optimized for batch processing in 8GB GPU's. If you limit your max-length further, you may be forced to exclude many sentences, which are hard to come by. I'm asking without knowing the language of course...

MichaelKohler commented 2 years ago

I have checked with others and per-language NFC normalization is something we want to support. Therefore I implemented that capability here: https://github.com/common-voice/sentence-collector/commit/5a86a81a6da7533e9571d2411777ef058a4419bb . I've already enabled it for Korean, however it's not on the live website yet. I will create a new release once we have the validation here as well.

This means:

Would be great if you could update your branch to be based on the latest main branch, and test out my addition and whether this works correctly for your validation as well.

Happy to provide any more context if needed and thanks for the suggestion!

Wouldn't it better to increase the max sentence length? Something like 70? Common Voice recordings are limited to 10 sec recordings, mainly optimized for batch processing in 8GB GPU's. If you limit your max-length further, you may be forced to exclude many sentences, which are hard to come by. I'm asking without knowing the language of course...

Not by too much though. 70 would mean roughly 8.5 seconds on average. If somebody reads/speaks 20% slower than average, then it would already hit the 10 seconds hard limit.

sftblw commented 2 years ago

Sorry, This PR close was mistake; I'm not familiar with forked Github repository and pressed the "sync" with discard button. I'll reopen after some modification.

sftblw commented 2 years ago
MichaelKohler commented 2 years ago

:tada: This PR is included in version 2.18.0 :tada:

The release is available on GitHub release

Your semantic-release bot :package::rocket: