Open tshrinivasan opened 3 years ago
download the entire website using the tool httrack. it will get all the html files locally.
Then, parse the html files to remove all english letters, symbols. Get only tamil content.
@tshrinivasan இந்தப் பணி இன்னும் உள்ளதா? நான் இதற்குப் பங்களிக்க விரும்புகிறேன்.
I did parse above websites and get tamil words. Here is the link, https://github.com/Thenmozhi295/tamil_words
To create a good spellcheker, we need a collection of good words. we can use them as it is for quick lookup using bloomfilter.
Scrap the below sites, clean them, remove english, numbers and symbols.
https://writerpara.com/ https://padhaakai.com https://solvanam.com/ https://jeyamohan.in/ https://www.sramakrishnan.com/ https://amuttu.net/ http://charuonline.com/blog/ https://komalimedai.blogspot.com/ http://www.akaramuthala.in/
Get below items for each site