common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Feature Request - Removal of parentheses. #198

Closed HarikalarKutusu closed 1 year ago

HarikalarKutusu commented 1 year ago

Parentheses: They are problematic for the Common Voice use.

  1. They are disruptive to the natural speech flow, they are not used in everyday speech. If read from a text, they can be "read" such as "open parenthesis" and "parenthesis closed" which will invalidate the recordings in CV.
  2. Wiki resources many times use parentheses to refer to original wording in the original language/how it is pronounced etc. These are mostly in another alphabet and their existence removes the sentence from possible sentences.

Here are some examples for the second point - ref. article Istanbul:

The city was founded as Byzantium (Greek: Βυζάντιον, Byzantion) in the 7th century BCE by Greek settlers from Megara.

In English the stress is on the first or last syllable, but in Turkish it is on the second syllable (-tan-).

On the European side, near the point of the peninsula (Sarayburnu), there was a Thracian settlement during the early 1st millennium BCE.

If you allow them in the allowed_symbols_regex, the first problem will arise.

They should not be in the CV text corpora, and removal of them would increase getting more possibilities for the random selection (rules related + shortened sentences + lesser known words blacklisted, etc), especially important for lower resource languages. I expect a 10-20% increase in the raw resource (to be measured).

So I propose (and start to implement) a rule "remove_parentheses", boolean, default false.

HarikalarKutusu commented 1 year ago

This one is handled by #199 with a more general approach, namely the use of a user-defined bracket list.