common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Add Turkish support (2023-08 finalized) #185

Closed HarikalarKutusu closed 1 year ago

HarikalarKutusu commented 1 year ago

This PR will add Turkish support to the Sentence Extractor - which can now be merged.

TL;DR

It took me nearly six weeks to learn, analyze and get some "acceptable" results. I had to iteratively run it many (>100) times to get better results, over 3 month worth of Wiki dumps. I'll leave rather detailed steps below as a record.

The given error rate is the average of 14 samples (400 sentences each), a total of 5600 sentences. This is about 1.62% of the total, or ~1.29 error margin with a 95% confidence level (see).


Problem and Goal

Our main problem lies in the fact that Turkish is an agglutinative language, where some stem words can provide more than a million alternatives. Therefore, blacklisting low-frequency words does not work, as those are mostly valid words with many suffixes appended, possibly leaving out important words, thus phonemes. Without getting their stem words and checking them against dictionaries we could not white-list valid ones so that we can check the invalid ones.

Our second problem: The stemmers. Due to the nature of the language, none of them are working flawlessly, especially for non-dictionary words (places, nation names, etc). And many times they misfire (we found about 20% error rate).

Our third problem was the use of 3 random sentences, which would not cover all possibilities. We had to work on all possible sentences, by changing the code locally so that we can get all sentences and work on them to build a blacklist.

Our fourth problem was the content quality/nature of the Turkish Wikipedia. Most were much shorter than English counterparts, many with little info, or just presenting lists. We had to try to get the most out of these by less black-listing, e.g. by relaxing proper names.

Given the random selection of the script, as a result, we have worked on the whole set to mostly white-list the tokens to be able to get the best black-list.

This would have resulted in a full scan of all sentences (>5.5 M) and tokens (>7M). To make this humanly possible, we had to work on 3 word minimum version and make use of dictionary checks (fortunately one open-source dictionary had non-stem word versions, although not complete). This way, the number we should check could be lowered, also with the help of the automated process described below. In spite of this, we had to spend more than a month to form a complete black-list.

Other goals we had during the process:

Rules

We intended to extract longer sentences because as of Common Voice v14.0, the average sentence length (and thus recording duration) was low.

Measure v14.0 Data Wiki-Extract From Our Last Run Real Exported Result
Average Recording Duration 3.595 5.839 TBA
Median Recording Duration 3.264 5.600 TBA
Average Characters/Sentence 29.923 67.521 TBA
Median Characters/Sentence 22 65 TBA
Average Words/Sentence 4.36 8.78 TBA
Median Words/Sentence 3 8 TBA

Deciding on minimum words

Please note: Values in different runs are not directly comparable, as the rule set and/or blacklists evolved in time.

To find the ideal point, we had to analyze the set multiple times for different limits and compare the results. We ran 2 initial round-ups for this purpose, near the middle of our black/white-listing process, also making use of dictionaries.

We found out 3 or 4 min words will be needed. We used 3 words for blacklist forming, and finally played with min_characters to find a more ideal point, maximizing the resultant recording duration.

Decision: 3 words minimum, but the sentence should be at least 20 chars long.

We relaxed the max_word_count to 20 (later increased to 25), aiming to limit the length by newly added max_characters (which we set to 115 initially, but after recognizing it is the number of alphabetical characters - not the string length, we dropped it to 105 - above that 10s recording limit can be missed if spoken more slowly).

Blacklist formation

In the beginning, we started with the whole set and quickly recognized that checking 1.5M different words will be impossible. So we started working on min_words 3 version, also incorporating dictionaries and stemmers to eliminate known words. After this point, we had two phases:

(As we passed the month in this lengthy process, we had to do it for the August 2023 dump, to check additional 3,800 words)

Blacklist Builder Details

A set of simple Python scripts we created runs on the whole token list using multiprocessing and chunks and contains the following data items:

We did not want non-Turkish-alphabet words to go into the blacklists, because we can already eliminate them with allowed_symbols_regex, which also dropped the size of the tokens and black-list considerably.

The Algorithm

Iterative process

We decided on a more thorough analysis of the data because the list kept being huge and had many words with characters not in the alphabet. We should have used the rules from the start! We should not use the --no_check parameter So we ran two round-ups mentioned above:

Final Phase: To finally get the full blacklist, check the results, and enhance some rules in the process.

Results After 3rd Iteration with 2023-08 Wikipedia Dump

Exp Rules BL MinW Exp Sentences Avg. Len Tokens Non-dict Description
max :x: :x: 1 5,618,313 110.13 1,953,986 1,541,266 U1, no_check, no replace, no apostrophes split
maxs :x: :x: 1 5,607,441 111.25 1,586,532 1,172,797 U2, no_check, +replace, +ap. split
maxsp :x: :x: 1 5,509,426 109.51 1,471,893 1,060,495 U2, no_check, +replace, +bracket removal, +ap. split
maxr :white_check_mark: :x: 1 1,211,889 81.25 530,259 273,839 U3, +rules (non-limiting)
maxrb :white_check_mark: :white_check_mark: 1 848,212 78.79 328,092 97,034 U4, +rules (non-limiting), +BL (125k)
Z4r :white_check_mark: :x: 4 1,001,331 70.55 446,920 183,945 No BL
Z4rb :white_check_mark: :white_check_mark: 4 706,169 69.15 269,451 31,089 BL (126k)
Z4rb3s :white_check_mark: :white_check_mark: 4 3 342,263 68.18 182,634 15,582 BL (126k)
Z3r :white_check_mark: :x: 3 1,026,916 69.44 451,385 186,868 No BL
Z3rb :white_check_mark: :white_check_mark: 3 725,908 67.96 272,042 31,993 BL (126k)
Z3rb3s :white_check_mark: :white_check_mark: 3 3 348,316 67.02 182,895 15,582 BL (126k)

After final adjustments to min sentence length:

Exp Rules BL MinW Exp Sentences Avg. Len Tokens Non-dict Description
F3r :white_check_mark: :x: 3 1,017,403 69.94 450,327 165,966 +rules - No BL (all possible)
F3rb :white_check_mark: :white_check_mark: 3 720,819 68.54 267,912 7,204 +BL (131k) (remaining possible)
F3rb3s :white_check_mark: :white_check_mark: 3 3 347,441 67.52 181,675 4,625 +3 sentence/article

Simple statistics of the final run

{
    "lc": "tr",
    "infile": "/home/bozden/GITREPO/data/wiki.tr.F3rb3s.txt",
    "char_dur": 0.1,
    "s_cnt": 347441,
    "sentence_len": {
        "tot": 23459807,
        "min": 23,
        "max": 129,
        "avg": 67.52170008720906,
        "med": 65.0,
        "std": 24.050121831833685
    },
    "normalized_len": {
        "tot": 23085498,
        "min": 22,
        "max": 128,
        "avg": 66.44436897199812,
        "med": 64.0,
        "std": 24.019759366951778
    },
    "alpha_len": {
        "tot": 20289265,
        "min": 20,
        "max": 105,
        "avg": 58.39628886631112,
        "med": 56.0,
        "std": 20.94889008475279
    },
    "word_count": {
        "tot": 3051270,
        "min": 3,
        "max": 25,
        "avg": 8.782124159209765,
        "med": 8.0,
        "std": 3.2639047106477945
    },
    "duration": {
        "tot": 563.5906944444444,
        "min": 2.0,
        "max": 10.5,
        "avg": 5.839628886631111,
        "med": 5.6000000000000005,
        "std": 2.09488900847528
    }
}

We expect 560-600 hours of single recordings from this set.

Test sets

We alpha-tested some sampling and made some corrections first.

No Persona [No - Error Rate](File Link)
1 Me (knows stuff) (TR/EN/DE) 001: 1.00% - 002: 3.00%

Initial findings:

For the population size of ~350,000, with a 95% confidence level and 2% margin of error, we needed 2,385 sample size to be checked. Rounding this value to 2400, we created 6 non-intersecting sets of size 400 sentences.

For this, we used the 4-word sentences 3 sentence/article as population and offered them to volunteers via translated/enhanced Excel sheets.

No Persona [No](File Link) Error-Count/Rate
1 Me (knows stuff) (TR/EN/DE) 01 7/ 1.75% & 02 5/ 1.25% +
2 Ret. radio speaker (TR) 03 4/ 1.00% & 09 4/ 1.00% r
3 Ret. pharmacist (TR/little EN) 04 15/ 3.75% & 08 11/ 2.75% r
4 High school student (TR/EN) 05 14/ 3.50%
5 AI expert (TR/EN) 06 6/ 1.50%
6 Art historian (TR/little EN) 07 16/ 4.00%
7 Computer Engineer (TR/EN) 10 4/ 1.00%

Total errors / Total Sentences - Error rate: 86 / 4,000 - 2.15%

After more iterations & black/whitelisting, using samples from max_words = 3:

No Persona [No](File Link) Error-Count/Rate
1 Me (knows stuff) (TR/EN/DE) 11 11/ 2.75%
2 Ret. radio speaker (TR) 12 7/ 1.75%
3 Ret. pharmacist (TR/little EN) 13 15/ 3.75%
7 Computer Engineer (TR/EN) 15 14/ 3.50%

Total errors / Total Sentences - Error rate: 47 / 1,600 - 2.94%

In general, the error rate becomes 134 / 5,600 = 2.39%

Code additions/fixes

During the course, we needed some code changes/additions:

Suggested additions:

MichaelKohler commented 1 year ago

This is very extensive documentation, great! I had a quick look at the code and that all looks good to me. I didn't question the values too much as I think you explained most of it in the description here anyway.

Could you rebase on top of the latest main branch so that the CI checks run again? Apart from that I'd say "ping me for final review" when completely ready :)