eklem / stopword-sami

Sami stopword lists for natural language processing. Examples on use could be search engines, machine learning and chatbots.
MIT License
1 stars 0 forks source link

stopword-trainer counting a bit off? #26

Open eklem opened 2 years ago

eklem commented 2 years ago

The word kulturhistorisk is found three times in document corpus. Once in one doc and twice in another doc:

But in the calculation file it's only listed as found once over the whole corpus: https://github.com/eklem/stopword-sami/blob/trunk/stopwords/stopword-sma-calculation.json#L11127-L11133

This is possibly an upstream error in stopword-trainer. Quick fix would maybe be to delete the calculation file and do it from scratch?