KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
37 stars 3 forks source link

Collect and publish TerraBytes of Tamil text data for LLM #198

Open tshrinivasan opened 4 months ago

tshrinivasan commented 4 months ago

To build LLMs, we need huge volume of tamil text data. in TBs

So far we dont have TBs of data available in Tamil, with open license.

There are many researchers who spend huge amount of time in scrapping websites and try to build LLMs. Due to the copyright issues, they can not share the data in public.

To solve this, we have to do the below things

Plan:

  1. define the folder structure the scrapped data
  2. Find the place to publish TBs of text data for free or cheap. (s3, archive.org)
  3. Find the websites with tamil text in open licenses like Public Domain, Creative Commons etc.

Execution:

  1. find the open licensed websites
  2. scrap the data manually or automatically, regularly
  3. publish the data every month

We have to do the above for few years regularly.

there are many individuals who collected many MBs to GBs of open licensed data and published in github, kaggle etc. We can curate them and add in the collection too.

If you are interested in contributing for this project, for any of the above activities, please comment here with your email address and what task you can work on.

khaleeljageer commented 4 months ago

Great Idea..

ThangaAyyanar commented 4 months ago

https://github.com/neechalkaran/Tamil-corpus

Neelchalkaran have few tamil corpus files

masatheesh commented 4 months ago

Would like to contribute my time

kishore57 commented 4 months ago

I am Kishore, Let me try to contribute from my side as well.

kishorekumar5795@gmail.com

tshrinivasan commented 4 months ago

Collecting here the list of open license content in Tamil, so far available

  1. wikipedia ( we can download dump and clean them)
  2. wikinews
  3. wikisource completed books
  4. Project Madurai https://www.projectmadurai.org/pmworks.html (75% of them are in public domain)
  5. FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text)
  6. Kaniyam.com - CC-BY-SA
  7. Vinavu.com - Creative Commons Attribution-Noncommercial-No Derivative
  8. theekkathir.in/copyright- CC-BY-SA
  9. https://www.tamilvu.org/ta/library-libcontnt-273141 has many old books in HTML format
  10. தமிழ் மண் பதிப்பகம் 1000+ நூல்கள்
tshrinivasan commented 4 months ago

Blogs that are in CC license

  1. neel48.blogspot.com
  2. gnutamil.blogspot.com
  3. http://www.badriseshadri.in/ - CC-BY
  4. https://web.archive.org/web/20210808093459/https://maattru.com/
  5. http://vinavu.com/
  6. http://anmikam4dumbme.blogspot.in/2013/06/creative-commons.html
  7. http://blog.ravidreams.net/cc-by-sa-3-0/
  8. http://poniyinselvan.blogspot.in/
  9. http://mmauran.net/blog
  10. http://tech.neechalkaran.com/p/copyleft.html
  11. http://tamilcpu.blogspot.in/
  12. http://jayabarathan.wordpress.com/
  13. http://chalkpiece.wordpress.com/
  14. http://adaleru.wordpress.com/
  15. http://blog.tamilsasi.com/
  16. http://www.tamilpoetry.com/
  17. http://www.bladepedia.com/
  18. http://www.teachersofindia.org/ta
  19. http://www.palkalaikazhakam.com/
  20. http://avanishiva.blogspot.in/
  21. http://ramanans.wordpress.com/
  22. http://www.bloggernanban.com/
  23. http://mstamil.com/
  24. http://www.saravanakumaran.com/
  25. http://www.tamilnaduthyagigal.blogspot.com/
  26. http://www.ilakkiyapayilagam.blogspot.com/
  27. http://www.kambaramayanam-thanjavooraan.blogspot.com/
  28. http://www.bharathipayilagam.blogspot.com/
  29. https://akazhonline.com/?page_id=2488
  30. kaniyam.com
  31. முழு மஹாபாரதம் http://mahabharatham.arasan.info/
  32. திருவிவிலியம் https://ta.wikisource.org/s/m - பழைய ஏற்பாடு, புதிய ஏற்பாடு
  33. https://tamil.wiki/ - only text in CC license.
jabezlamech commented 4 months ago

I’m interested! My email id is jabez.lamech@gmail.com

NaveenJoshuvaDev commented 4 months ago

how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means

IngersolNorway commented 4 months ago

Add me in

tshrinivasan commented 4 months ago

how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means

Hi @NaveenJoshuvaDev There are 850+ ebooks in FreeTamilEbooks.com We have to download all the epub files from there and convert to text file format using pandoc software.

Natkeeran commented 4 months ago

Pls see the Thamizh Mann Collections Text Data from here (~750 MB).
https://github.com/Digital-Tamil-Studies/open_tamil_texts/tree/master/collections/thamizh_mann

The following catalogue also attempts to list kown data sets related to Tamil. https://tamil.digital.utsc.utoronto.ca/tamil-open-data-catalogue

Also note that Sri Lankan government publications are considered exempted from copyright. That is the basis for American Institute for Sri Lankan Studies justification for digitizing government publications. Note the details in Intellectual Property Act, No. 36 of 2003 here - page 9. Thus, the Sri Lankan government content can collected and made accessible. If people are collecting, consider collecting them as multilingual data sets.

NaveenJoshuvaDev commented 4 months ago

how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means

Hi @NaveenJoshuvaDev There are 850+ ebooks in FreeTamilEbooks.com We have to download all the epub files from there and convert to text file format using pandoc software.

I got it sir will try to do it

tshrinivasan commented 4 months ago

https://autonlp.ai/datasets/cc100-tamil Here is 1.3 GB of tamil text data from CommonCrawls project.

This can be used as fair use policy. ref - https://en.wikipedia.org/wiki/Common_Crawl

masatheesh commented 3 months ago

Downloaded epub "https://freetamilebooks.com/ebooks/rti_2005_guide/" and executed below command

pandoc rti_2005_guide.epub -t plain -o rti.txt

Got attached file. How to take it forward? rti.txt

5hanth commented 3 months ago

I’m interested to try making a surrealML package for the LLM model. It will be useful to perform inference and build backendless micro SAAS with surrealDB once this is packaged.

Natkeeran commented 3 months ago

@tshrinivasan pls see @masatheesh msg above. Perhaps that dataset can be added to https://github.com/KaniyamFoundation/Ebooks.

Another list of datasets that contain Tamil: https://opus.nlpl.eu/results/en&ta/corpus-result-table

tshrinivasan commented 3 months ago

Downloaded epub "https://freetamilebooks.com/ebooks/rti_2005_guide/" and executed below command

pandoc rti_2005_guide.epub -t plain -o rti.txt

Got attached file. How to take it forward? rti.txt

Hi @masatheesh this is fine. We have to do for all the books.

https://fte.mohan43u.space/books.json Here is a list of json for all the books.

counter stats is here - https://fte.mohan43u.space/

Use this https://fte.mohan43u.space/books.json to get the URL of epub, download all the epub files and convert to text in a automated way.

Share the text files and scripts.

velkadamban commented 3 weeks ago

Count me in..!

tshrinivasan commented 2 weeks ago

Collected wikipedia articles as one big text file and published here https://kaniyam.cloudns.nz/tamil_datasets/

details are here Collecting content for LLM dataset – Part 1 – Tamil wikipedia content

tshrinivasan commented 2 weeks ago

Collected all data from FreeTamilEbooks.com

details are here

Collecting content for LLM dataset – Part 2 – FreeTamilEbooks

https://goinggnu.wordpress.com/2024/06/16/collecting-content-for-llm-dataset-part-2-freetamilebooks/

https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

data is here https://kaniyam.cloudns.nz/tamil_datasets/fte-books/

LLM #openData #CreativeCommons #Tamil

RaMathuZen commented 2 weeks ago

What if we translate English datasets to Tamil ?

Obviously, only 60-80% would be translated to proper Tamil but if we could have a thesaurus then we can translate jargons unknown to translator, parse the text for grammar and others i.e if we have the tools then we can filter those datasets and thus we can get a refined >90% tamil text data in large amounts.

(It is just an idea and those estimates are my guesses)

velkadamban commented 2 weeks ago

Collected and added all these below works in this repo (Few Datasets are very huge in size! So, added the links to them): https://github.com/velkadamban/Tamil-Corpus

Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0) Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0) Oscar 23.01 Tamil Meta Data (CC BY 4.0) Project Madurai (Open to use and Distribute) Tamil Wikisource books (CC BY-SA 4.0) Tamil Mann Nationalized Books (CC BY-SA 4.0) Leipzig Corpus CC-100 Corpus Ai4Bharat ( CC- 0) Alpca-ora Translated for Tamil (GPL-3.0)

balajijagadesh commented 2 weeks ago

Can take all the proofread books from ta.wikisource.org. Then Tamil content from tva and project Madurai.

On Sun, 16 Jun 2024 at 3:05 PM, Velkadamban (A) Daarwin Kanna < @.***> wrote:

Collected and added all these below works in this repo (Few Datasets are very huge in size! So, added the links to them): https://github.com/velkadamban/Tamil-Corpus

Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0) Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0) Oscar 23.01 Tamil Meta Data (CC BY 4.0) Project Madurai (Open to use and Distribute) Tamil Wikisource books (CC BY-SA 4.0) Tamil Mann Nationalized Books (CC BY-SA 4.0) Leipzig Corpus CC-100 Corpus Ai4Bharat ( CC- 0) Alpca-ora Translated for Tamil (GPL-3.0)

— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171322416, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRG7E3MUW5IYWO54ODDZHVL5NAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGMZDENBRGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

velkadamban commented 2 weeks ago

https://github.com/velkadamban/Tamil-Corpus/blob/main/Links%20to%202218%20Books%20in%20TVA

This file contain 2218 links of Nationalized Tamil Book PDFs uploaded in TVA

balajijagadesh commented 2 weeks ago

For machine learning what is the use of image pdf? need tamil text right?

On Sun, Jun 16, 2024 at 3:27 PM Velkadamban (A) Daarwin Kanna < @.***> wrote:

https://github.com/velkadamban/Tamil-Corpus/blob/main/Links%20to%202218%20Books%20in%20TVA

This file contain 2218 links of Nationalized Tamil Book PDFs uploaded in TVA

— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171366672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRFUQUMHQLDXIIF45YDZHVOSFAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGM3DMNRXGI . You are receiving this because you commented.Message ID: @.***>

velkadamban commented 2 weeks ago

Yeah we need to OCR thoese PDFs.. எல்லாமே pdf ஆ தான் இருக்கு. OCR செய்து Text ஐ எடுக்கணும்...

balajijagadesh commented 2 weeks ago

There will be lot of mistakes in ocr. Better to take text from proofread material in Wikisource

On Sun, 16 Jun 2024 at 4:29 PM, Velkadamban (A) Daarwin Kanna < @.***> wrote:

Yeah we need to OCR thoese PDFs.. எல்லாமே pdf ஆ தான் இருக்கு. OCR செய்து Text ஐ எடுக்கணும்...

— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171433597, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRC5GGLTVWKSMOD5L6DZHVVXHAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGQZTGNJZG4 . You are receiving this because you commented.Message ID: @.***>

velkadamban commented 2 weeks ago

எனக்கு ஒரு ஐயம்! தமிழுக்கான Text தரவுகளை நாம் சேகரிக்கிறோம்... ஆனால் கிரந்த எழுத்துக்களின் பயன்பாடும் பிற மொழி சொற்களின் பயன்பாடும் இணையத்தில் கிடைக்கும் தரவுகளிலும் பல நூல்களிலும் மிகுதியாக உள்ளதே! இந்திய மொழிகள் அனைத்துமே சங்கத கலப்பு உள்ளவை தான்! அவற்றின் இலக்கணமும் அதை அனுமதிக்கும்! ஆனால் தமிழ் மட்டும் தானே அப்படி இல்லாமல் வேறு மொழிகளின் கலப்பின்றி செயல்படும் ஆற்றல் பெற்றது!

தமிழை கணினி பயன்பாட்டில் எந்த அளவு கொண்டு வர எண்ணுகிறோமோ, அந்த அளவிற்கு தனித்தமிழில் கொண்டுவரவேண்டும்!

gurulenin commented 1 week ago

Text extracted from a few Tamil theses on Shodhganga is available here. Contains 4.5 Lakhs Words. It is licensed under CC-BY-SA-NC. Shodhganga, with over 5,000 Tamil theses, may contain more than 12 Lakh pages and 20 Crore words. This data is useful for Tamil LLMs.

https://github.com/gurulenin/Shodhganga_tamil_thesis

tshrinivasan commented 1 week ago

https://github.com/vanangamudi/cholloadai-2021 https://archive.org/details/cholloadai-2021.txt

சொல்லோடை தரவு தொகுப்பின் முதல் பதிப்பு. 2021இல் தமிழ் இணைய மாநாட்டில் வெளிவந்த "சொல்லோடை: கற்கும்-கருவிகளுக்கு ஒரு சொற்றொடர் படையல்" ஆய்வுரையை காணவும்.

tshrinivasan commented 1 week ago

few dictionaries are here

https://github.com/vanangamudi/tharavukkanam/tree/master/tamil-etymological-dict https://github.com/vanangamudi/tharavukkanam/tree/master