common-voice / cv-sentence-extractor

Scraping Wikipedia for fair use sentences
52 stars 52 forks source link

max_characters in rules #182

Closed HarikalarKutusu closed 1 year ago

HarikalarKutusu commented 1 year ago

I'm trying to make this work with Turkish. I see the following:

min_trimmed_length = 
min_word_count = 
max_word_count = 
min_characters = 

If I'm not mistaken, there is no max_characters setting. Like in German, Turkish words have a high variance in length due to the agglutinative nature of the language. So, a 5-6 word sentence can be quite long while reading.

I've been also planning to change the sentence-collector rules to use charter length instead of words, but I can see that it is missing here.

If this is true, can this be added?

MichaelKohler commented 1 year ago

I think that sounds reasonable to be implemented. I'd suggest to implement this the same way as min_characters, including a test for it and documentation in the README.

Thanks for bringing this up.

HarikalarKutusu commented 1 year ago

OK, I'll do it tomorrow on a clean clone (I've been messing with the current one)...

HarikalarKutusu commented 1 year ago

Implemented with #183