Ezhil-Language-Foundation / open-tamil

Open Source Tamil NLP Tools - தமிழ் இயற்கை மொழி பகுப்பாய்வு நிரல்தொகுப்பு
http://tamilpesu.us
MIT License
264 stars 80 forks source link

Corpus word set for Solthiruthi #206

Open arcturusannamalai opened 4 years ago

arcturusannamalai commented 4 years ago

Use open datasets from 1) https://www.kaggle.com/disisbig/tamil-wikipedia-articles 2) https://www.kaggle.com/disisbig/tamil-news-dataset

VpkPrasanna commented 1 year ago

Hi @arcturusannamalai can you please elaborate this issue . do we need to add this dataset into our library ?

arcturusannamalai commented 1 year ago

@VpkPrasanna - yes you can use these datasets and form a valid word list for the spelling checker; currently the word lists are https://github.com/Ezhil-Language-Foundation/open-tamil/blob/main/solthiruthi/data/tamilvu_dictionary_words.txt etc.

VpkPrasanna commented 1 year ago

@VpkPrasanna - yes you can use these datasets and form a valid word list for the spelling checker; currently the word lists are https://github.com/Ezhil-Language-Foundation/open-tamil/blob/main/solthiruthi/data/tamilvu_dictionary_words.txt etc.

SO i have to add the new datasets into the same file right ?