KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
37 stars 3 forks source link

Create Tamil blogs content data set #178

Open Natkeeran opened 2 years ago

Natkeeran commented 2 years ago

There are thousands of Tamil blogs. Most are not active. This provides a rich avenue to generate a dataset as well as to preserve the blog content.

Loop through each publicly available blog and get a json representation for each post (including the metadata). Post process to convert this into a large csv!

Would be good if associated media files within that domain can be download, but this is optional.

http://tamilpoint.blogspot.com/p/tamil-blogs.html

khaleeljageer commented 2 years ago

What about license of the individual blog?

Natkeeran commented 2 years ago

@khaleeljageer

Same arguments as common-crawl or Internet Archive would make. Few ways to address this: