KaniyamFoundation / ProjectIdeas

A Place to write down the project ideas and to plan them
40 stars 3 forks source link

Collect and publish TerraBytes of Tamil text data for LLM #198

Open tshrinivasan opened 9 months ago

tshrinivasan commented 9 months ago

To build LLMs, we need huge volume of tamil text data. in TBs

So far we dont have TBs of data available in Tamil, with open license.

There are many researchers who spend huge amount of time in scrapping websites and try to build LLMs. Due to the copyright issues, they can not share the data in public.

To solve this, we have to do the below things

Plan:

  1. define the folder structure the scrapped data
  2. Find the place to publish TBs of text data for free or cheap. (s3, archive.org)
  3. Find the websites with tamil text in open licenses like Public Domain, Creative Commons etc.

Execution:

  1. find the open licensed websites
  2. scrap the data manually or automatically, regularly
  3. publish the data every month

We have to do the above for few years regularly.

there are many individuals who collected many MBs to GBs of open licensed data and published in github, kaggle etc. We can curate them and add in the collection too.

If you are interested in contributing for this project, for any of the above activities, please comment here with your email address and what task you can work on.

khaleeljageer commented 9 months ago

Great Idea..

ThangaAyyanar commented 9 months ago

https://github.com/neechalkaran/Tamil-corpus

Neelchalkaran have few tamil corpus files

masatheesh commented 9 months ago

Would like to contribute my time

kishore57 commented 9 months ago

I am Kishore, Let me try to contribute from my side as well.

kishorekumar5795@gmail.com

tshrinivasan commented 9 months ago

Collecting here the list of open license content in Tamil, so far available

  1. wikipedia ( we can download dump and clean them)
  2. wikinews
  3. wikisource completed books
  4. Project Madurai https://www.projectmadurai.org/pmworks.html (75% of them are in public domain)
  5. FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text)
  6. Kaniyam.com - CC-BY-SA
  7. Vinavu.com - Creative Commons Attribution-Noncommercial-No Derivative
  8. theekkathir.in/copyright- CC-BY-SA
  9. https://www.tamilvu.org/ta/library-libcontnt-273141 has many old books in HTML format
  10. தமிழ் மண் பதிப்பகம் 1000+ நூல்கள்
tshrinivasan commented 9 months ago

Blogs that are in CC license

  1. neel48.blogspot.com
  2. gnutamil.blogspot.com
  3. http://www.badriseshadri.in/ - CC-BY
  4. https://web.archive.org/web/20210808093459/https://maattru.com/
  5. http://vinavu.com/
  6. http://anmikam4dumbme.blogspot.in/2013/06/creative-commons.html
  7. http://blog.ravidreams.net/cc-by-sa-3-0/
  8. http://poniyinselvan.blogspot.in/
  9. http://mmauran.net/blog
  10. http://tech.neechalkaran.com/p/copyleft.html
  11. http://tamilcpu.blogspot.in/
  12. http://jayabarathan.wordpress.com/
  13. http://chalkpiece.wordpress.com/
  14. http://adaleru.wordpress.com/
  15. http://blog.tamilsasi.com/
  16. http://www.tamilpoetry.com/
  17. http://www.bladepedia.com/
  18. http://www.teachersofindia.org/ta
  19. http://www.palkalaikazhakam.com/
  20. http://avanishiva.blogspot.in/
  21. http://ramanans.wordpress.com/
  22. http://www.bloggernanban.com/
  23. http://mstamil.com/
  24. http://www.saravanakumaran.com/
  25. http://www.tamilnaduthyagigal.blogspot.com/
  26. http://www.ilakkiyapayilagam.blogspot.com/
  27. http://www.kambaramayanam-thanjavooraan.blogspot.com/
  28. http://www.bharathipayilagam.blogspot.com/
  29. https://akazhonline.com/?page_id=2488
  30. kaniyam.com
  31. முழு மஹாபாரதம் http://mahabharatham.arasan.info/
  32. திருவிவிலியம் https://ta.wikisource.org/s/m - பழைய ஏற்பாடு, புதிய ஏற்பாடு
  33. https://tamil.wiki/ - only text in CC license.
jabezlamech commented 9 months ago

I’m interested! My email id is jabez.lamech@gmail.com

NaveenJoshuvaDev commented 9 months ago

how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means

IngersolNorway commented 9 months ago

Add me in

tshrinivasan commented 8 months ago

how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means

Hi @NaveenJoshuvaDev There are 850+ ebooks in FreeTamilEbooks.com We have to download all the epub files from there and convert to text file format using pandoc software.

Natkeeran commented 8 months ago

Pls see the Thamizh Mann Collections Text Data from here (~750 MB).
https://github.com/Digital-Tamil-Studies/open_tamil_texts/tree/master/collections/thamizh_mann

The following catalogue also attempts to list kown data sets related to Tamil. https://tamil.digital.utsc.utoronto.ca/tamil-open-data-catalogue

Also note that Sri Lankan government publications are considered exempted from copyright. That is the basis for American Institute for Sri Lankan Studies justification for digitizing government publications. Note the details in Intellectual Property Act, No. 36 of 2003 here - page 9. Thus, the Sri Lankan government content can collected and made accessible. If people are collecting, consider collecting them as multilingual data sets.

NaveenJoshuvaDev commented 8 months ago

how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means

Hi @NaveenJoshuvaDev There are 850+ ebooks in FreeTamilEbooks.com We have to download all the epub files from there and convert to text file format using pandoc software.

I got it sir will try to do it

tshrinivasan commented 8 months ago

https://autonlp.ai/datasets/cc100-tamil Here is 1.3 GB of tamil text data from CommonCrawls project.

This can be used as fair use policy. ref - https://en.wikipedia.org/wiki/Common_Crawl

masatheesh commented 8 months ago

Downloaded epub "https://freetamilebooks.com/ebooks/rti_2005_guide/" and executed below command

pandoc rti_2005_guide.epub -t plain -o rti.txt

Got attached file. How to take it forward? rti.txt

5hanth commented 8 months ago

I’m interested to try making a surrealML package for the LLM model. It will be useful to perform inference and build backendless micro SAAS with surrealDB once this is packaged.

Natkeeran commented 8 months ago

@tshrinivasan pls see @masatheesh msg above. Perhaps that dataset can be added to https://github.com/KaniyamFoundation/Ebooks.

Another list of datasets that contain Tamil: https://opus.nlpl.eu/results/en&ta/corpus-result-table

tshrinivasan commented 8 months ago

Downloaded epub "https://freetamilebooks.com/ebooks/rti_2005_guide/" and executed below command

pandoc rti_2005_guide.epub -t plain -o rti.txt

Got attached file. How to take it forward? rti.txt

Hi @masatheesh this is fine. We have to do for all the books.

https://fte.mohan43u.space/books.json Here is a list of json for all the books.

counter stats is here - https://fte.mohan43u.space/

Use this https://fte.mohan43u.space/books.json to get the URL of epub, download all the epub files and convert to text in a automated way.

Share the text files and scripts.

velkadamban commented 5 months ago

Count me in..!

tshrinivasan commented 5 months ago

Collected wikipedia articles as one big text file and published here https://kaniyam.cloudns.nz/tamil_datasets/

details are here Collecting content for LLM dataset – Part 1 – Tamil wikipedia content

tshrinivasan commented 5 months ago

Collected all data from FreeTamilEbooks.com

details are here

Collecting content for LLM dataset – Part 2 – FreeTamilEbooks

https://goinggnu.wordpress.com/2024/06/16/collecting-content-for-llm-dataset-part-2-freetamilebooks/

https://github.com/KaniyamFoundation/ProjectIdeas/issues/198

data is here https://kaniyam.cloudns.nz/tamil_datasets/fte-books/

LLM #openData #CreativeCommons #Tamil

RaMathuZen commented 5 months ago

What if we translate English datasets to Tamil ?

Obviously, only 60-80% would be translated to proper Tamil but if we could have a thesaurus then we can translate jargons unknown to translator, parse the text for grammar and others i.e if we have the tools then we can filter those datasets and thus we can get a refined >90% tamil text data in large amounts.

(It is just an idea and those estimates are my guesses)

velkadamban commented 5 months ago

Collected and added all these below works in this repo (Few Datasets are very huge in size! So, added the links to them): https://github.com/velkadamban/Tamil-Corpus

Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0) Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0) Oscar 23.01 Tamil Meta Data (CC BY 4.0) Project Madurai (Open to use and Distribute) Tamil Wikisource books (CC BY-SA 4.0) Tamil Mann Nationalized Books (CC BY-SA 4.0) Leipzig Corpus CC-100 Corpus Ai4Bharat ( CC- 0) Alpca-ora Translated for Tamil (GPL-3.0)

balajijagadesh commented 5 months ago

Can take all the proofread books from ta.wikisource.org. Then Tamil content from tva and project Madurai.

On Sun, 16 Jun 2024 at 3:05 PM, Velkadamban (A) Daarwin Kanna < @.***> wrote:

Collected and added all these below works in this repo (Few Datasets are very huge in size! So, added the links to them): https://github.com/velkadamban/Tamil-Corpus

Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0) Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0) Oscar 23.01 Tamil Meta Data (CC BY 4.0) Project Madurai (Open to use and Distribute) Tamil Wikisource books (CC BY-SA 4.0) Tamil Mann Nationalized Books (CC BY-SA 4.0) Leipzig Corpus CC-100 Corpus Ai4Bharat ( CC- 0) Alpca-ora Translated for Tamil (GPL-3.0)

— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171322416, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRG7E3MUW5IYWO54ODDZHVL5NAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGMZDENBRGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

velkadamban commented 5 months ago

https://github.com/velkadamban/Tamil-Corpus/blob/main/Links%20to%202218%20Books%20in%20TVA

This file contain 2218 links of Nationalized Tamil Book PDFs uploaded in TVA

balajijagadesh commented 5 months ago

For machine learning what is the use of image pdf? need tamil text right?

On Sun, Jun 16, 2024 at 3:27 PM Velkadamban (A) Daarwin Kanna < @.***> wrote:

https://github.com/velkadamban/Tamil-Corpus/blob/main/Links%20to%202218%20Books%20in%20TVA

This file contain 2218 links of Nationalized Tamil Book PDFs uploaded in TVA

— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171366672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRFUQUMHQLDXIIF45YDZHVOSFAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGM3DMNRXGI . You are receiving this because you commented.Message ID: @.***>

velkadamban commented 5 months ago

Yeah we need to OCR thoese PDFs.. எல்லாமே pdf ஆ தான் இருக்கு. OCR செய்து Text ஐ எடுக்கணும்...

balajijagadesh commented 5 months ago

There will be lot of mistakes in ocr. Better to take text from proofread material in Wikisource

On Sun, 16 Jun 2024 at 4:29 PM, Velkadamban (A) Daarwin Kanna < @.***> wrote:

Yeah we need to OCR thoese PDFs.. எல்லாமே pdf ஆ தான் இருக்கு. OCR செய்து Text ஐ எடுக்கணும்...

— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171433597, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRC5GGLTVWKSMOD5L6DZHVVXHAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGQZTGNJZG4 . You are receiving this because you commented.Message ID: @.***>

velkadamban commented 5 months ago

எனக்கு ஒரு ஐயம்! தமிழுக்கான Text தரவுகளை நாம் சேகரிக்கிறோம்... ஆனால் கிரந்த எழுத்துக்களின் பயன்பாடும் பிற மொழி சொற்களின் பயன்பாடும் இணையத்தில் கிடைக்கும் தரவுகளிலும் பல நூல்களிலும் மிகுதியாக உள்ளதே! இந்திய மொழிகள் அனைத்துமே சங்கத கலப்பு உள்ளவை தான்! அவற்றின் இலக்கணமும் அதை அனுமதிக்கும்! ஆனால் தமிழ் மட்டும் தானே அப்படி இல்லாமல் வேறு மொழிகளின் கலப்பின்றி செயல்படும் ஆற்றல் பெற்றது!

தமிழை கணினி பயன்பாட்டில் எந்த அளவு கொண்டு வர எண்ணுகிறோமோ, அந்த அளவிற்கு தனித்தமிழில் கொண்டுவரவேண்டும்!

gurulenin commented 5 months ago

Text extracted from a few Tamil theses on Shodhganga is available here. Contains 4.5 Lakhs Words. It is licensed under CC-BY-SA-NC. Shodhganga, with over 5,000 Tamil theses, may contain more than 12 Lakh pages and 20 Crore words. This data is useful for Tamil LLMs.

https://github.com/gurulenin/Shodhganga_tamil_thesis

tshrinivasan commented 5 months ago

https://github.com/vanangamudi/cholloadai-2021 https://archive.org/details/cholloadai-2021.txt

சொல்லோடை தரவு தொகுப்பின் முதல் பதிப்பு. 2021இல் தமிழ் இணைய மாநாட்டில் வெளிவந்த "சொல்லோடை: கற்கும்-கருவிகளுக்கு ஒரு சொற்றொடர் படையல்" ஆய்வுரையை காணவும்.

tshrinivasan commented 5 months ago

few dictionaries are here

https://github.com/vanangamudi/tharavukkanam/tree/master/tamil-etymological-dict https://github.com/vanangamudi/tharavukkanam/tree/master

IngersolNorway commented 4 months ago

Hi Gurulenin,

Thank you for providing the thesis data. We've attempted this numerous times without success, but you've finally made it happen. However, the data only includes 10 theses. Could you help us obtain the rest of the Tamil papers or guide us on how to do it ourselves?

We've only found chapters in PDF format, while you have Unicode text. Could you tell me how you accomplished this? We would appreciate your assistance if getting the data in PDF is also an option.

Thank you.

Regards,

Ingersol Selvaraj http://ingersol.no/personal.html

gurulenin commented 4 months ago

Hai,

These theses are directly obtained from research scholars who provided them in Word format. Some theses were in Bamini font. I converted them to Unicode format and uploaded them here.

On Thu, Jul 4, 2024 at 12:35 PM Ingersol Norway @.***> wrote:

Hi Gurulenin,

Thank you for providing the thesis data. We've attempted this numerous times without success, but you've finally made it happen. However, the data only includes 10 theses. Could you help us obtain the rest of the Tamil papers or guide us on how to do it ourselves?

We've only found chapters in PDF format, while you have Unicode text. Could you tell me how you accomplished this? We would appreciate your assistance if getting the data in PDF is also an option.

Thank you.

Regards,

Ingersol Selvaraj

Mechanical Project Engineer

Konvallvegen 92 | 4700 Vennesla | Norway

M +47 46 24 90 46

http://ingersol.no/personal.html http://ingersol.no/personal.html

E @. @.>*

This message, including any attachments, is intended only for the addressee and may contain privileged or confidential information. Any unauthorized disclosure is strictly prohibited. Thank you.

On Fri, Jun 21, 2024 at 3:59 PM gurulenin @.***> wrote:

Text extracted from a few Tamil theses on Shodhganga is available here. Contains 4.5 Lakhs Words. It is licensed under CC-BY-SA-NC. Shodhganga, with over 5,000 Tamil theses, may contain more than 12 Lakh pages and 20 Crore words. This data is useful for Tamil LLMs.

https://github.com/gurulenin/Shodhganga_tamil_thesis

— Reply to this email directly, view it on GitHub < https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2182810603>,

or unsubscribe < https://github.com/notifications/unsubscribe-auth/AS4MHZ2SZEMV53SEYPJFUWTZIQWTDAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSHAYTANRQGM>

. You are receiving this because you commented.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2208262915, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADU2EHSKW2JEP3NPOODXLVDZKTX2ZAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBYGI3DEOJRGU . You are receiving this because you commented.Message ID: @.***>

vishnumur777 commented 3 months ago

I would like to contribute.

varunushamurali@gmail.com

vishnumur777 commented 3 months ago

வணக்கம் சீனிவாசன் ஐயா,

தமிழில் எழுத்து வடிவிலான தரவுகளை பெரு மொழிப் போல்மம் (LLM) உருவாக்குவதை போல, ஒலி வடிவிலான பெரு மொழிப் போல்மம் (LLM) உருவாக்க வேண்டும் ஐயா. அதற்கான தரவுகளை சேகரிப்பதற்கு, GitHub மூலமாக ஒரு புதிய வாத செய்தியை (Issue) உருவாக்குங்கள் ஐயா. அதில் எழுத்து வடிவிலான தரவுகளை வாசித்து , அதனை பதிவு செய்து, ஆடியோ போர்மட், குறிப்பாக (mp3,flac,ogg) போன்ற ஒலிக்கோப்பு படிவத்தில் பதிவேற்றம் செய்யலாம் ஐயா. தற்போது தெலுங்கானா மாநிலத்தில், மக்கள் மூலமாக தெலுங்கானா அரசு தெலுங்கு மொழில் இதனை செய்து இருக்கிறது.

kamalaak commented 2 months ago

I'm excited to contribute to this project. I am available to assist with the automatic data scraping task.

Email Address: kamaleshpeast@gmail.com

tshrinivasan commented 2 days ago

Here are my blog posts on collecting open licensed data for tamil llm works.

part 1 – https://goinggnu.wordpress.com/2024/06/11/collecting-content-for-llm-dataset-part-1-tamil-wikipedia-content/

part 2 – https://goinggnu.wordpress.com/2024/06/16/collecting-content-for-llm-dataset-part-2-freetamilebooks/

part 3 - https://goinggnu.wordpress.com/2024/11/23/collecting-content-for-llm-dataset-part-3-thamizh_mann-books-project-madurai-wikisource/

Get all the data from https://kaniyam.cloudns.nz/tamil_datasets/