Closed arijitx closed 2 years ago
Thanks for working on this. I will have a few comments, will have a closer look later tonight or tomorrow.
Regarding linguists, would be great, but it's fine if it's "just" native speakers. For the amount, it should be more than 100 sentences though. I would suggest 500. Once I have added my comments, there is some work to be done, so I would suggest to wait with the review until then.
Sure @MichaelKohler waiting for your comments
Thanks for your quick reviews @MichaelKohler do let me know if am good to start with the error rate review process ?
Thanks @MichaelKohler , I have downloaded the sample and shuffled it randomly and taken 500 for further review.
Hi @MichaelKohler,
Reviews from 3 native speaker is completed, please share your feedback.
Correct rate: 95% https://docs.google.com/spreadsheets/d/1BJFLOq3L1gDYnHfMVSfUH_r43HS2HspDfrT0Gv3DREw/edit?usp=sharing
Correct rate 96% https://docs.google.com/spreadsheets/d/1XPjdZzyzqrXSCwTJvAtD6sK3L-00i5CERz-QBTk0MfQ/edit?usp=sharing
Correct rate 94% https://docs.google.com/spreadsheets/d/1yuvAY_vGgjnZ__lJwBjNMYx3yeGC8iFHIOjqddWBYRQ/edit?usp=sharing
Now that you have the reviews, did ypu notice a pattern in the wrong sentences? Anything that easily could be filtered out? If so, would be great to add that and do another review. If not, also ok :)
Hi @MichaelKohler , From my analysis most of the error is like incomplete/partial sentence or phrases, there isn't a very high number of them, either its written that way and probably from the context of article it made sense. I don't see a clear way to filter such sentences.
Basic Rules for Bengali, updated segmenter with nltk PunktSentenceTokenizer as punkt doesnt have a pretrained model in nltk for bengali so used the sentence end as '।'
How many sentences did you get at the end? ~ 129000
How did you create the blacklist file? Followed instruction in readme, chose words with frequency 50 or less
Evaluate random 500 sentences ? 500 random sentences I do not know any linguist any Bengali speaker please feel free to comment on the error ratio. My analysis shows very less number of errors in the extracted sentences.