KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
40 stars 3 forks source link

Scrap the websites with good tamil words #152

Open tshrinivasan opened 3 years ago

tshrinivasan commented 3 years ago

To create a good spellcheker, we need a collection of good words. we can use them as it is for quick lookup using bloomfilter.

Scrap the below sites, clean them, remove english, numbers and symbols.

https://writerpara.com/ https://padhaakai.com https://solvanam.com/ https://jeyamohan.in/ https://www.sramakrishnan.com/ https://amuttu.net/ http://charuonline.com/blog/ https://komalimedai.blogspot.com/ http://www.akaramuthala.in/

Get below items for each site

  1. individual words
  2. individual words with frequency
  3. bigram
  4. trigram
tshrinivasan commented 3 years ago

download the entire website using the tool httrack. it will get all the html files locally.

Then, parse the html files to remove all english letters, symbols. Get only tamil content.

velram commented 3 years ago

@tshrinivasan இந்தப் பணி இன்னும் உள்ளதா? நான் இதற்குப் பங்களிக்க விரும்புகிறேன்.

Thenmozhi295 commented 3 years ago

I did parse above websites and get tamil words. Here is the link, https://github.com/Thenmozhi295/tamil_words