common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Add Initial rules and blocklist for Igbo (ig) #160

Closed chrisemezue closed 2 years ago

chrisemezue commented 2 years ago

How many sentences did you get at the end?

After a successful run and tweaks from the original outputs, as well as help from @MichaelKohler and @stefangrotz we got this code generates 2165 sentences of a good accuracy upon review.

Here are all the steps I took.

How did you create the blocklist file?

Based on the guidelines provided in the readme section and filtering out words with a frequency of 20 and less. The frequency of 80 reduced our sentence to ~400. We checked and found that 20 was also giving us good sentences. I did not tweak the ig.toml file much.

Results from the reviews Spreadsheet of 800 random sentences has been reviewed. Results of review are here.

MichaelKohler commented 2 years ago

I had a quick look at the review document, are the following words easily pronounceable by most people? Right now the error rate is 0%, which to me would be very astonishing to be honest. Due to words that hard to pronounce, and some other reasons, we're allowing a certain percentage of error. Haven't seen 0 yet though :)

Also, I've noticed some Chinese characters: Akpaala okwu ahụ bụ 沉魚落雁,閉月羞花. I think this is a good example on why generally allowed_symbols should be preferred. Excluding all Chinese symbols would not work out well.

chrisemezue commented 2 years ago

Finally, the reviews are in. See them here. @MichaelKohler @stefangrotz #160 #160

chrisemezue commented 2 years ago

I'm so Sorry I mistakenly closed it. I added this comment to reopen.

MichaelKohler commented 2 years ago

Thanks for the review. I'm a bit confused though, as all sentences say "OK" but there are errors reported by the reviewers on the right side of the sheet. How did that work?

For example, on row 16 there is a sentence with "(Ph.D)" which should not be included as valid sentence as it's an abbreviation. Did that count as an error?

chrisemezue commented 2 years ago

Thanks for the review. I'm a bit confused though, as all sentences say "OK" but there are errors reported by the reviewers on the right side of the sheet. How did that work?

For example, on row 16 there is a sentence with "(Ph.D)" which should not be included as valid sentence as it's an abbreviation. Did that count as an error?

  1. The explanation for having OK but with comments is this: Igbo has extensive code-switching and spelling variations, meaning that a word could have many acceptable ways of writing it. For example coma with all English letters is acceptable. So also is koma. Another example is Nigeria, which also has Naigeria, Naijiria. So the comment is just the reviewer highlighting that there are other variants. That's probably why it was left as OK because the original is Okay.
  2. For this example Dọkịta nke Nkà Ihe Ọmụma (Ph.D.) mmemme bụ isi site na nyocha. with PhD (which as I checked is the only example with abbreviation), the Dọkịta nke Nkà Ihe Ọmụma is actually the Igbo meaning of PhD. So it's like a way of saying World health organization (WHO) where you say the full meaning and then the abbreviation. Also, I understood that the reason why abbreviations are not accepted is due to the ambiguity in pronouncing them (for example ICE which could be I.C.E or ice). But in pronouncing words like PhD in Igbo it is always done letter by letter. Notwithstanding, I shall correct it to take the PhD as an error.
MichaelKohler commented 2 years ago

Understood. I think this is okay then. Thanks for all the effort!