Closed HarikalarKutusu closed 1 year ago
This is very extensive documentation, great! I had a quick look at the code and that all looks good to me. I didn't question the values too much as I think you explained most of it in the description here anyway.
Could you rebase on top of the latest main
branch so that the CI checks run again? Apart from that I'd say "ping me for final review" when completely ready :)
This PR will add Turkish support to the Sentence Extractor - which can now be merged.
TL;DR
It took me nearly six weeks to learn, analyze and get some "acceptable" results. I had to iteratively run it many (>100) times to get better results, over 3 month worth of Wiki dumps. I'll leave rather detailed steps below as a record.
The given error rate is the average of 14 samples (400 sentences each), a total of 5600 sentences. This is about 1.62% of the total, or ~1.29 error margin with a 95% confidence level (see).
Problem and Goal
Our main problem lies in the fact that Turkish is an agglutinative language, where some stem words can provide more than a million alternatives. Therefore, blacklisting low-frequency words does not work, as those are mostly valid words with many suffixes appended, possibly leaving out important words, thus phonemes. Without getting their stem words and checking them against dictionaries we could not white-list valid ones so that we can check the invalid ones.
Our second problem: The stemmers. Due to the nature of the language, none of them are working flawlessly, especially for non-dictionary words (places, nation names, etc). And many times they misfire (we found about 20% error rate).
Our third problem was the use of 3 random sentences, which would not cover all possibilities. We had to work on all possible sentences, by changing the code locally so that we can get all sentences and work on them to build a blacklist.
Our fourth problem was the content quality/nature of the Turkish Wikipedia. Most were much shorter than English counterparts, many with little info, or just presenting lists. We had to try to get the most out of these by less black-listing, e.g. by relaxing proper names.
Given the random selection of the script, as a result, we have worked on the whole set to mostly white-list the tokens to be able to get the best black-list.
This would have resulted in a full scan of all sentences (>5.5 M) and tokens (>7M). To make this humanly possible, we had to work on 3 word minimum version and make use of dictionary checks (fortunately one open-source dictionary had non-stem word versions, although not complete). This way, the number we should check could be lowered, also with the help of the automated process described below. In spite of this, we had to spend more than a month to form a complete black-list.
Other goals we had during the process:
Rules
We intended to extract longer sentences because as of Common Voice v14.0, the average sentence length (and thus recording duration) was low.
Deciding on minimum words
To find the ideal point, we had to analyze the set multiple times for different limits and compare the results. We ran 2 initial round-ups for this purpose, near the middle of our black/white-listing process, also making use of dictionaries.
We found out 3 or 4 min words will be needed. We used 3 words for blacklist forming, and finally played with min_characters to find a more ideal point, maximizing the resultant recording duration.
Decision: 3 words minimum, but the sentence should be at least 20 chars long.
We relaxed the
max_word_count
to20
(later increased to 25), aiming to limit the length by newly addedmax_characters
(which we set to115
initially, but after recognizing it is the number of alphabetical characters - not the string length, we dropped it to105
- above that 10s recording limit can be missed if spoken more slowly).Blacklist formation
In the beginning, we started with the whole set and quickly recognized that checking 1.5M different words will be impossible. So we started working on min_words 3 version, also incorporating dictionaries and stemmers to eliminate known words. After this point, we had two phases:
Blacklist Builder
(below) iteratively to get an intermediate blacklist (work most frequent words from top to bottom)(As we passed the month in this lengthy process, we had to do it for the August 2023 dump, to check additional 3,800 words)
Blacklist Builder Details
A set of simple Python scripts we created runs on the whole token list using multiprocessing and chunks and contains the following data items:
We did not want non-Turkish-alphabet words to go into the blacklists, because we can already eliminate them with
allowed_symbols_regex
, which also dropped the size of the tokens and black-list considerably.The Algorithm
Iterative process
We used two imaginary breaking points for our iterative process:
During Part 1 & 2:
Part-3:
We decided on a more thorough analysis of the data because the list kept being huge and had many words with characters not in the alphabet. We should have used the rules from the start! We should not use the
--no_check
parameter So we ran two round-ups mentioned above:Final Phase: To finally get the full blacklist, check the results, and enhance some rules in the process.
Results After 3rd Iteration with 2023-08 Wikipedia Dump
After final adjustments to min sentence length:
Simple statistics of the final run
We expect 560-600 hours of single recordings from this set.
Test sets
We alpha-tested some sampling and made some corrections first.
Initial findings:
stem_separator_regex
changes. So we had to repeat the X4rb3s and test generation.-
tostem_separator_regex
For the population size of ~350,000, with a 95% confidence level and 2% margin of error, we needed 2,385 sample size to be checked. Rounding this value to 2400, we created 6 non-intersecting sets of size 400 sentences.
For this, we used the 4-word sentences 3 sentence/article as population and offered them to volunteers via translated/enhanced Excel sheets.
Total errors / Total Sentences - Error rate: 86 / 4,000 - 2.15%
After more iterations & black/whitelisting, using samples from max_words = 3:
Total errors / Total Sentences - Error rate: 47 / 1,600 - 2.94%
In general, the error rate becomes 134 / 5,600 = 2.39%
Code additions/fixes
During the course, we needed some code changes/additions:
max_characters
rule to cv-sentence-extractorstem_separator_regex
rule for enabling stem-word extraction from apostrophe suffixed words in cv-sentence-extractorbracket_removal_list
rule: To remove parentheses/brackets and the content inside them from a sentenceSuggested additions:
replace_unicode
to regex-replace same-looking characters from other Unicode pages - e.g. written with Cyrillic keyboards. We had a lot of them and had to use the replace list. Many Turkic countries use these keyboards and some of the other Tuckic language Wiki articles got translated via campaigns. We found ~75k such words affecting ~25k sentences.