Add Turkish support (2023-08 finalized)

This PR will add Turkish support to the Sentence Extractor - which can now be merged.

TL;DR

It took me nearly six weeks to learn, analyze and get some "acceptable" results. I had to iteratively run it many (>100) times to get better results, over 3 month worth of Wiki dumps. I'll leave rather detailed steps below as a record.

Blocklist process: Explained below
Sentence count (last example run): 347,441
Error rate: 2.375%

The given error rate is the average of 14 samples (400 sentences each), a total of 5600 sentences. This is about 1.62% of the total, or ~1.29 error margin with a 95% confidence level (see).

Problem and Goal

Our main problem lies in the fact that Turkish is an agglutinative language, where some stem words can provide more than a million alternatives. Therefore, blacklisting low-frequency words does not work, as those are mostly valid words with many suffixes appended, possibly leaving out important words, thus phonemes. Without getting their stem words and checking them against dictionaries we could not white-list valid ones so that we can check the invalid ones.

Our second problem: The stemmers. Due to the nature of the language, none of them are working flawlessly, especially for non-dictionary words (places, nation names, etc). And many times they misfire (we found about 20% error rate).

Our third problem was the use of 3 random sentences, which would not cover all possibilities. We had to work on all possible sentences, by changing the code locally so that we can get all sentences and work on them to build a blacklist.

Our fourth problem was the content quality/nature of the Turkish Wikipedia. Most were much shorter than English counterparts, many with little info, or just presenting lists. We had to try to get the most out of these by less black-listing, e.g. by relaxing proper names.

Given the random selection of the script, as a result, we have worked on the whole set to mostly white-list the tokens to be able to get the best black-list.

This would have resulted in a full scan of all sentences (>5.5 M) and tokens (>7M). To make this humanly possible, we had to work on 3 word minimum version and make use of dictionary checks (fortunately one open-source dictionary had non-stem word versions, although not complete). This way, the number we should check could be lowered, also with the help of the automated process described below. In spite of this, we had to spend more than a month to form a complete black-list.

Other goals we had during the process:

We aim for a near-zero error rate, definitely less than 1 (we could not reach this objective because of the grammar quality of the originals and added domain-specific jargon marked as "foreign").
We want longer sentences as we already have many shorter ones (our final estimate is 5.8 sec average recording time).
We want to scrape most from the resource because CC0 sentences are rare unless we write it. From a public domain book, we can process 3-5k sentences in 2-3 days, so each sentence we can get from Wikipedia counts.
We also wanted to get common native and foreign proper names (people, toponyms, brands, speakable commonly known technical terms/jargon, etc) so that we can reach a large number of domains, thus increasing the vocabulary.

Rules

We intended to extract longer sentences because as of Common Voice v14.0, the average sentence length (and thus recording duration) was low.

Measure	v14.0 Data	Wiki-Extract From Our Last Run	Real Exported Result
Average Recording Duration	3.595	5.839	TBA
Median Recording Duration	3.264	5.600	TBA
Average Characters/Sentence	29.923	67.521	TBA
Median Characters/Sentence	22	65	TBA
Average Words/Sentence	4.36	8.78	TBA
Median Words/Sentence	3	8	TBA

Deciding on minimum words

Please note: Values in different runs are not directly comparable, as the rule set and/or blacklists evolved in time.

To find the ideal point, we had to analyze the set multiple times for different limits and compare the results. We ran 2 initial round-ups for this purpose, near the middle of our black/white-listing process, also making use of dictionaries.

We found out 3 or 4 min words will be needed. We used 3 words for blacklist forming, and finally played with min_characters to find a more ideal point, maximizing the resultant recording duration.

Decision: 3 words minimum, but the sentence should be at least 20 chars long.

We relaxed the max_word_count to 20 (later increased to 25), aiming to limit the length by newly added max_characters (which we set to 115 initially, but after recognizing it is the number of alphabetical characters - not the string length, we dropped it to 105 - above that 10s recording limit can be missed if spoken more slowly).

Blacklist formation

In the beginning, we started with the whole set and quickly recognized that checking 1.5M different words will be impossible. So we started working on min_words 3 version, also incorporating dictionaries and stemmers to eliminate known words. After this point, we had two phases:

Use of the Blacklist Builder (below) iteratively to get an intermediate blacklist (work most frequent words from top to bottom)
After we reached ~130k blacklist size, we scanned it manually to form the final blacklist (manually checked dictionaries, Wikipedia pages itself, and sometimes English-Turkish dictionaries).

(As we passed the month in this lengthy process, we had to do it for the August 2023 dump, to check additional 3,800 words)

Blacklist Builder Details

A set of simple Python scripts we created runs on the whole token list using multiprocessing and chunks and contains the following data items:

Dictionary files: Open source dictionaries, combined in the script (started with TDK -Turkish Language Authority- and from Zemberek NLP toolset, added more in the process)
Forced Whitelist files - base: For forced whitelisting (extended in time). Files containing:
- Abbreviations/acronyms we expanded using "replace"
- Common male/female names and surnames in the language
- Cities and towns in the country
- Other toponyms (continents, countries, capital cities), mostly in the language
Forced Whitelist files - added: For forced whitelisting. These are added during the iterative process:
- Common foreign names where local pronunciation is the same.
- Words that were correct but auto-blacklisted by the algorithm (below) because they are not in dictionaries (e.g. toponyms) or caused by stemmer failures.
Forced Blacklist files - base: For premature forced blacklisting.
- Single-character tokens (alphabet) which do not exist elsewhere
- Other cleaned abbreviations/acronyms (we removed valid words from TDK's official list as we work with lowercase)
Forced Blacklist files - added: For forced blacklisting. Mainly collected during the iterative process to reach our goal.

We did not want non-Turkish-alphabet words to go into the blacklists, because we can already eliminate them with allowed_symbols_regex, which also dropped the size of the tokens and black-list considerably.

The Algorithm

Prepare/combine data, remove duplicates
Loop through word_usage.
.dict.txt (remaining from dictionary-filtered frequency list)
- If the word is in the hard blacklist, it is known, else
- If the word is in the hard whitelist, OK, or else
- If the word is in the dictionary, OK, else
- Get the stem word using snowball stemmer (not always possible - so later we added another stemmer and used combined results):
  - Check it against the hard blacklist, if exists, blacklist it, or else
  - Check it against the hard whitelist, if exists, OK, or else
  - Check it against the dictionary, if exists, OK, or else
  - Blacklist the word

Iterative process

We used two imaginary breaking points for our iterative process:
- Part-1: Until ~20,000'th record (frequency 72), we worked more deeply by also collecting the foreign words
- Part-2: After that, until 100.000'th record (frequency 12), we collected missed Turkish words and some important foreign ones
- Part-3: After that we let the algorithm run as it is automatically.
During Part 1 & 2:
- We checked every word and distributed them between forced black-list and forced white-list files.
- The above algorithm has been run several times, until we reach 10% of the remaining, to collect misfirings, where we distribute the detections between hard blacklists and hard whitelists.
- For foreign words (usually proper nouns), we determined their relevance in the country (e.g. historical figures) and their pronunciation. If it is pronounced the same in the language, we whitelisted them, or else foreign words got blacklisted.
Part-3:
- We ran the algorithm, got a blacklist file of the remaining non-recognized ones, and sorted it to better visually recognize the same (missed) stem word with multiple suffixes. There were ~ 1M words...
- We pass through words starting with "a" (~65k) in 5-6 hours and saw that about 18% of them are still extractable.
- As that would result in more than anticipated time demand, we decided to change our approach (for the remaining - below freq 12, b-z)

We decided on a more thorough analysis of the data because the list kept being huge and had many words with characters not in the alphabet. We should have used the rules from the start! We should not use the --no_check parameter So we ran two round-ups mentioned above:

Final Phase: To finally get the full blacklist, check the results, and enhance some rules in the process.

Results After 3rd Iteration with 2023-08 Wikipedia Dump

Exp	Rules	BL	MinW	Exp	Sentences	Avg. Len	Tokens	Non-dict	Description
max	:x:	:x:	1	∞	5,618,313	110.13	1,953,986	1,541,266	U1, no_check, no replace, no apostrophes split
maxs	:x:	:x:	1	∞	5,607,441	111.25	1,586,532	1,172,797	U2, no_check, +replace, +ap. split
maxsp	:x:	:x:	1	∞	5,509,426	109.51	1,471,893	1,060,495	U2, no_check, +replace, +bracket removal, +ap. split
maxr	:white_check_mark:	:x:	1	∞	1,211,889	81.25	530,259	273,839	U3, +rules (non-limiting)
maxrb	:white_check_mark:	:white_check_mark:	1	∞	848,212	78.79	328,092	97,034	U4, +rules (non-limiting), +BL (125k)
Z4r	:white_check_mark:	:x:	4	∞	1,001,331	70.55	446,920	183,945	No BL
Z4rb	:white_check_mark:	:white_check_mark:	4	∞	706,169	69.15	269,451	31,089	BL (126k)
Z4rb3s	:white_check_mark:	:white_check_mark:	4	3	342,263	68.18	182,634	15,582	BL (126k)
Z3r	:white_check_mark:	:x:	3	∞	1,026,916	69.44	451,385	186,868	No BL
Z3rb	:white_check_mark:	:white_check_mark:	3	∞	725,908	67.96	272,042	31,993	BL (126k)
Z3rb3s	:white_check_mark:	:white_check_mark:	3	3	348,316	67.02	182,895	15,582	BL (126k)

After final adjustments to min sentence length:

Exp	Rules	BL	MinW	Exp	Sentences	Avg. Len	Tokens	Non-dict	Description
F3r	:white_check_mark:	:x:	3	∞	1,017,403	69.94	450,327	165,966	+rules - No BL (all possible)
F3rb	:white_check_mark:	:white_check_mark:	3	∞	720,819	68.54	267,912	7,204	+BL (131k) (remaining possible)
F3rb3s	:white_check_mark:	:white_check_mark:	3	3	347,441	67.52	181,675	4,625	+3 sentence/article

Simple statistics of the final run

{
    "lc": "tr",
    "infile": "/home/bozden/GITREPO/data/wiki.tr.F3rb3s.txt",
    "char_dur": 0.1,
    "s_cnt": 347441,
    "sentence_len": {
        "tot": 23459807,
        "min": 23,
        "max": 129,
        "avg": 67.52170008720906,
        "med": 65.0,
        "std": 24.050121831833685
    },
    "normalized_len": {
        "tot": 23085498,
        "min": 22,
        "max": 128,
        "avg": 66.44436897199812,
        "med": 64.0,
        "std": 24.019759366951778
    },
    "alpha_len": {
        "tot": 20289265,
        "min": 20,
        "max": 105,
        "avg": 58.39628886631112,
        "med": 56.0,
        "std": 20.94889008475279
    },
    "word_count": {
        "tot": 3051270,
        "min": 3,
        "max": 25,
        "avg": 8.782124159209765,
        "med": 8.0,
        "std": 3.2639047106477945
    },
    "duration": {
        "tot": 563.5906944444444,
        "min": 2.0,
        "max": 10.5,
        "avg": 5.839628886631111,
        "med": 5.6000000000000005,
        "std": 2.09488900847528
    }
}

We expect 560-600 hours of single recordings from this set.

Test sets

We alpha-tested some sampling and made some corrections first.

No	Persona	[No - Error Rate](File Link)
1	Me (knows stuff) (TR/EN/DE)	001: 1.00% - 002: 3.00%

Initial findings:

There are some proper names that are not pronounced equally in Turkish passed the filters (they should be blacklisted)
- => We found out that we did not pull the latest stem_separator_regex changes. So we had to repeat the X4rb3s and test generation.
Constructs like "M-class planets" or some foreign names with dash might cause problems while reading.
- => Added - to stem_separator_regex

For the population size of ~350,000, with a 95% confidence level and 2% margin of error, we needed 2,385 sample size to be checked. Rounding this value to 2400, we created 6 non-intersecting sets of size 400 sentences.

For this, we used the 4-word sentences 3 sentence/article as population and offered them to volunteers via translated/enhanced Excel sheets.

No	Persona	[No](File Link) Error-Count/Rate
1	Me (knows stuff) (TR/EN/DE)	01 7/ 1.75% & 02 5/ 1.25%	+
2	Ret. radio speaker (TR)	03 4/ 1.00% & 09 4/ 1.00%	r
3	Ret. pharmacist (TR/little EN)	04 15/ 3.75% & 08 11/ 2.75%	r
4	High school student (TR/EN)	05 14/ 3.50%
5	AI expert (TR/EN)	06 6/ 1.50%
6	Art historian (TR/little EN)	07 16/ 4.00%
7	Computer Engineer (TR/EN)	10 4/ 1.00%

Total errors / Total Sentences - Error rate: 86 / 4,000 - 2.15%

Many of the errors are because of the low language/editor quality of the articles themselves, which just cannot be prevented with a word-level blocklist.
The second major source is that some people did not use the Common Voice system and we did not give enough information about "readability", they rather evaluated the sentences like an editor preparing a text for print (e.g. saying "it will be better to put a comma here"). Therefore we re-scanned the results to correct some.
Also, when disconnected from the content, some sentences can become not easy to understand.

After more iterations & black/whitelisting, using samples from max_words = 3:

No	Persona	[No](File Link) Error-Count/Rate
1	Me (knows stuff) (TR/EN/DE)	11 11/ 2.75%
2	Ret. radio speaker (TR)	12 7/ 1.75%
3	Ret. pharmacist (TR/little EN)	13 15/ 3.75%
7	Computer Engineer (TR/EN)	15 14/ 3.50%

Total errors / Total Sentences - Error rate: 47 / 1,600 - 2.94%

In general, the error rate becomes 134 / 5,600 = 2.39%

Code additions/fixes

During the course, we needed some code changes/additions:

Fix --strip-apostrophes code for non-standard apostrophes in cvtools
Add max_characters rule to cv-sentence-extractor
Add stem_separator_regex rule for enabling stem-word extraction from apostrophe suffixed words in cv-sentence-extractor
Add bracket_removal_list rule: To remove parentheses/brackets and the content inside them from a sentence

Suggested additions:

Rule: replace_unicode to regex-replace same-looking characters from other Unicode pages - e.g. written with Cyrillic keyboards. We had a lot of them and had to use the replace list. Many Turkic countries use these keyboards and some of the other Tuckic language Wiki articles got translated via campaigns. We found ~75k such words affecting ~25k sentences.

common-voice / cv-sentence-extractor