Open tshrinivasan opened 4 months ago
Great Idea..
https://github.com/neechalkaran/Tamil-corpus
Neelchalkaran have few tamil corpus files
Would like to contribute my time
I am Kishore, Let me try to contribute from my side as well.
Collecting here the list of open license content in Tamil, so far available
Blogs that are in CC license
I’m interested! My email id is jabez.lamech@gmail.com
how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means
Add me in
how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means
Hi @NaveenJoshuvaDev There are 850+ ebooks in FreeTamilEbooks.com We have to download all the epub files from there and convert to text file format using pandoc software.
Pls see the Thamizh Mann Collections Text Data from here (~750 MB).
https://github.com/Digital-Tamil-Studies/open_tamil_texts/tree/master/collections/thamizh_mann
The following catalogue also attempts to list kown data sets related to Tamil. https://tamil.digital.utsc.utoronto.ca/tamil-open-data-catalogue
Also note that Sri Lankan government publications are considered exempted from copyright. That is the basis for American Institute for Sri Lankan Studies justification for digitizing government publications. Note the details in Intellectual Property Act, No. 36 of 2003 here - page 9. Thus, the Sri Lankan government content can collected and made accessible. If people are collecting, consider collecting them as multilingual data sets.
how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means
Hi @NaveenJoshuvaDev There are 850+ ebooks in FreeTamilEbooks.com We have to download all the epub files from there and convert to text file format using pandoc software.
I got it sir will try to do it
https://autonlp.ai/datasets/cc100-tamil Here is 1.3 GB of tamil text data from CommonCrawls project.
This can be used as fair use policy. ref - https://en.wikipedia.org/wiki/Common_Crawl
Downloaded epub "https://freetamilebooks.com/ebooks/rti_2005_guide/" and executed below command
pandoc rti_2005_guide.epub -t plain -o rti.txt
Got attached file. How to take it forward? rti.txt
I’m interested to try making a surrealML package for the LLM model. It will be useful to perform inference and build backendless micro SAAS with surrealDB once this is packaged.
@tshrinivasan pls see @masatheesh msg above. Perhaps that dataset can be added to https://github.com/KaniyamFoundation/Ebooks.
Another list of datasets that contain Tamil: https://opus.nlpl.eu/results/en&ta/corpus-result-table
Downloaded epub "https://freetamilebooks.com/ebooks/rti_2005_guide/" and executed below command
pandoc rti_2005_guide.epub -t plain -o rti.txt
Got attached file. How to take it forward? rti.txt
Hi @masatheesh this is fine. We have to do for all the books.
https://fte.mohan43u.space/books.json Here is a list of json for all the books.
counter stats is here - https://fte.mohan43u.space/
Use this https://fte.mohan43u.space/books.json to get the URL of epub, download all the epub files and convert to text in a automated way.
Share the text files and scripts.
Count me in..!
Collected wikipedia articles as one big text file and published here https://kaniyam.cloudns.nz/tamil_datasets/
details are here Collecting content for LLM dataset – Part 1 – Tamil wikipedia content
Collected all data from FreeTamilEbooks.com
details are here
Collecting content for LLM dataset – Part 2 – FreeTamilEbooks
https://goinggnu.wordpress.com/2024/06/16/collecting-content-for-llm-dataset-part-2-freetamilebooks/
https://github.com/KaniyamFoundation/ProjectIdeas/issues/198
data is here https://kaniyam.cloudns.nz/tamil_datasets/fte-books/
What if we translate English datasets to Tamil ?
Obviously, only 60-80% would be translated to proper Tamil but if we could have a thesaurus then we can translate jargons unknown to translator, parse the text for grammar and others i.e if we have the tools then we can filter those datasets and thus we can get a refined >90% tamil text data in large amounts.
(It is just an idea and those estimates are my guesses)
Collected and added all these below works in this repo (Few Datasets are very huge in size! So, added the links to them): https://github.com/velkadamban/Tamil-Corpus
Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0) Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0) Oscar 23.01 Tamil Meta Data (CC BY 4.0) Project Madurai (Open to use and Distribute) Tamil Wikisource books (CC BY-SA 4.0) Tamil Mann Nationalized Books (CC BY-SA 4.0) Leipzig Corpus CC-100 Corpus Ai4Bharat ( CC- 0) Alpca-ora Translated for Tamil (GPL-3.0)
Can take all the proofread books from ta.wikisource.org. Then Tamil content from tva and project Madurai.
On Sun, 16 Jun 2024 at 3:05 PM, Velkadamban (A) Daarwin Kanna < @.***> wrote:
Collected and added all these below works in this repo (Few Datasets are very huge in size! So, added the links to them): https://github.com/velkadamban/Tamil-Corpus
Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0) Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0) Oscar 23.01 Tamil Meta Data (CC BY 4.0) Project Madurai (Open to use and Distribute) Tamil Wikisource books (CC BY-SA 4.0) Tamil Mann Nationalized Books (CC BY-SA 4.0) Leipzig Corpus CC-100 Corpus Ai4Bharat ( CC- 0) Alpca-ora Translated for Tamil (GPL-3.0)
— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171322416, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRG7E3MUW5IYWO54ODDZHVL5NAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGMZDENBRGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
https://github.com/velkadamban/Tamil-Corpus/blob/main/Links%20to%202218%20Books%20in%20TVA
This file contain 2218 links of Nationalized Tamil Book PDFs uploaded in TVA
For machine learning what is the use of image pdf? need tamil text right?
On Sun, Jun 16, 2024 at 3:27 PM Velkadamban (A) Daarwin Kanna < @.***> wrote:
https://github.com/velkadamban/Tamil-Corpus/blob/main/Links%20to%202218%20Books%20in%20TVA
This file contain 2218 links of Nationalized Tamil Book PDFs uploaded in TVA
— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171366672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRFUQUMHQLDXIIF45YDZHVOSFAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGM3DMNRXGI . You are receiving this because you commented.Message ID: @.***>
Yeah we need to OCR thoese PDFs.. எல்லாமே pdf ஆ தான் இருக்கு. OCR செய்து Text ஐ எடுக்கணும்...
There will be lot of mistakes in ocr. Better to take text from proofread material in Wikisource
On Sun, 16 Jun 2024 at 4:29 PM, Velkadamban (A) Daarwin Kanna < @.***> wrote:
Yeah we need to OCR thoese PDFs.. எல்லாமே pdf ஆ தான் இருக்கு. OCR செய்து Text ஐ எடுக்கணும்...
— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171433597, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRC5GGLTVWKSMOD5L6DZHVVXHAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGQZTGNJZG4 . You are receiving this because you commented.Message ID: @.***>
எனக்கு ஒரு ஐயம்! தமிழுக்கான Text தரவுகளை நாம் சேகரிக்கிறோம்... ஆனால் கிரந்த எழுத்துக்களின் பயன்பாடும் பிற மொழி சொற்களின் பயன்பாடும் இணையத்தில் கிடைக்கும் தரவுகளிலும் பல நூல்களிலும் மிகுதியாக உள்ளதே! இந்திய மொழிகள் அனைத்துமே சங்கத கலப்பு உள்ளவை தான்! அவற்றின் இலக்கணமும் அதை அனுமதிக்கும்! ஆனால் தமிழ் மட்டும் தானே அப்படி இல்லாமல் வேறு மொழிகளின் கலப்பின்றி செயல்படும் ஆற்றல் பெற்றது!
தமிழை கணினி பயன்பாட்டில் எந்த அளவு கொண்டு வர எண்ணுகிறோமோ, அந்த அளவிற்கு தனித்தமிழில் கொண்டுவரவேண்டும்!
Text extracted from a few Tamil theses on Shodhganga is available here. Contains 4.5 Lakhs Words. It is licensed under CC-BY-SA-NC. Shodhganga, with over 5,000 Tamil theses, may contain more than 12 Lakh pages and 20 Crore words. This data is useful for Tamil LLMs.
https://github.com/vanangamudi/cholloadai-2021 https://archive.org/details/cholloadai-2021.txt
சொல்லோடை தரவு தொகுப்பின் முதல் பதிப்பு. 2021இல் தமிழ் இணைய மாநாட்டில் வெளிவந்த "சொல்லோடை: கற்கும்-கருவிகளுக்கு ஒரு சொற்றொடர் படையல்" ஆய்வுரையை காணவும்.
To build LLMs, we need huge volume of tamil text data. in TBs
So far we dont have TBs of data available in Tamil, with open license.
There are many researchers who spend huge amount of time in scrapping websites and try to build LLMs. Due to the copyright issues, they can not share the data in public.
To solve this, we have to do the below things
Plan:
Execution:
We have to do the above for few years regularly.
there are many individuals who collected many MBs to GBs of open licensed data and published in github, kaggle etc. We can curate them and add in the collection too.
If you are interested in contributing for this project, for any of the above activities, please comment here with your email address and what task you can work on.