common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

Add Bengali bn #169

Closed arijitx closed 2 years ago

arijitx commented 2 years ago

Basic Rules for Bengali, updated segmenter with nltk PunktSentenceTokenizer as punkt doesnt have a pretrained model in nltk for bengali so used the sentence end as '।'

How many sentences did you get at the end? ~ 129000

How did you create the blacklist file? Followed instruction in readme, chose words with frequency 50 or less

Evaluate random 500 sentences ? 500 random sentences I do not know any linguist any Bengali speaker please feel free to comment on the error ratio. My analysis shows very less number of errors in the extracted sentences.

MichaelKohler commented 2 years ago

Thanks for working on this. I will have a few comments, will have a closer look later tonight or tomorrow.

Regarding linguists, would be great, but it's fine if it's "just" native speakers. For the amount, it should be more than 100 sentences though. I would suggest 500. Once I have added my comments, there is some work to be done, so I would suggest to wait with the review until then.

arijitx commented 2 years ago

Sure @MichaelKohler waiting for your comments

arijitx commented 2 years ago

Thanks for your quick reviews @MichaelKohler do let me know if am good to start with the error rate review process ?

arijitx commented 2 years ago

Thanks @MichaelKohler , I have downloaded the sample and shuffled it randomly and taken 500 for further review.

arijitx commented 2 years ago

Hi @MichaelKohler,

Reviews from 3 native speaker is completed, please share your feedback.

Correct rate: 95% https://docs.google.com/spreadsheets/d/1BJFLOq3L1gDYnHfMVSfUH_r43HS2HspDfrT0Gv3DREw/edit?usp=sharing

Review 2:

Correct rate 96% https://docs.google.com/spreadsheets/d/1XPjdZzyzqrXSCwTJvAtD6sK3L-00i5CERz-QBTk0MfQ/edit?usp=sharing

Review 3:

Correct rate 94% https://docs.google.com/spreadsheets/d/1yuvAY_vGgjnZ__lJwBjNMYx3yeGC8iFHIOjqddWBYRQ/edit?usp=sharing

MichaelKohler commented 2 years ago

Now that you have the reviews, did ypu notice a pattern in the wrong sentences? Anything that easily could be filtered out? If so, would be great to add that and do another review. If not, also ok :)

arijitx commented 2 years ago

Hi @MichaelKohler , From my analysis most of the error is like incomplete/partial sentence or phrases, there isn't a very high number of them, either its written that way and probably from the context of article it made sense. I don't see a clear way to filter such sentences.