Parentheses: They are problematic for the Common Voice use.
They are disruptive to the natural speech flow, they are not used in everyday speech. If read from a text, they can be "read" such as "open parenthesis" and "parenthesis closed" which will invalidate the recordings in CV.
Wiki resources many times use parentheses to refer to original wording in the original language/how it is pronounced etc. These are mostly in another alphabet and their existence removes the sentence from possible sentences.
Here are some examples for the second point - ref. article Istanbul:
The city was founded as Byzantium (Greek: Βυζάντιον, Byzantion) in the 7th century BCE by Greek settlers from Megara.
In English the stress is on the first or last syllable, but in Turkish it is on the second syllable (-tan-).
On the European side, near the point of the peninsula (Sarayburnu), there was a Thracian settlement during the early 1st millennium BCE.
If you allow them in the allowed_symbols_regex, the first problem will arise.
They should not be in the CV text corpora, and removal of them would increase getting more possibilities for the random selection (rules related + shortened sentences + lesser known words blacklisted, etc), especially important for lower resource languages. I expect a 10-20% increase in the raw resource (to be measured).
So I propose (and start to implement) a rule "remove_parentheses", boolean, default false.
Parentheses: They are problematic for the Common Voice use.
Here are some examples for the second point - ref. article Istanbul:
If you allow them in the
allowed_symbols_regex
, the first problem will arise.They should not be in the CV text corpora, and removal of them would increase getting more possibilities for the random selection (rules related + shortened sentences + lesser known words blacklisted, etc), especially important for lower resource languages. I expect a 10-20% increase in the raw resource (to be measured).
So I propose (and start to implement) a rule "remove_parentheses", boolean, default false.