Open tshrinivasan opened 9 months ago
Great Idea..
https://github.com/neechalkaran/Tamil-corpus
Neelchalkaran have few tamil corpus files
Would like to contribute my time
I am Kishore, Let me try to contribute from my side as well.
Collecting here the list of open license content in Tamil, so far available
Blogs that are in CC license
I’m interested! My email id is jabez.lamech@gmail.com
how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means
Add me in
how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means
Hi @NaveenJoshuvaDev There are 850+ ebooks in FreeTamilEbooks.com We have to download all the epub files from there and convert to text file format using pandoc software.
Pls see the Thamizh Mann Collections Text Data from here (~750 MB).
https://github.com/Digital-Tamil-Studies/open_tamil_texts/tree/master/collections/thamizh_mann
The following catalogue also attempts to list kown data sets related to Tamil. https://tamil.digital.utsc.utoronto.ca/tamil-open-data-catalogue
Also note that Sri Lankan government publications are considered exempted from copyright. That is the basis for American Institute for Sri Lankan Studies justification for digitizing government publications. Note the details in Intellectual Property Act, No. 36 of 2003 here - page 9. Thus, the Sri Lankan government content can collected and made accessible. If people are collecting, consider collecting them as multilingual data sets.
how to collect it can you explain ,here you said this FreeTamilEbooks (870+ books are there in CreativeCommons license. download all the epub files and convert to text) it means
Hi @NaveenJoshuvaDev There are 850+ ebooks in FreeTamilEbooks.com We have to download all the epub files from there and convert to text file format using pandoc software.
I got it sir will try to do it
https://autonlp.ai/datasets/cc100-tamil Here is 1.3 GB of tamil text data from CommonCrawls project.
This can be used as fair use policy. ref - https://en.wikipedia.org/wiki/Common_Crawl
Downloaded epub "https://freetamilebooks.com/ebooks/rti_2005_guide/" and executed below command
pandoc rti_2005_guide.epub -t plain -o rti.txt
Got attached file. How to take it forward? rti.txt
I’m interested to try making a surrealML package for the LLM model. It will be useful to perform inference and build backendless micro SAAS with surrealDB once this is packaged.
@tshrinivasan pls see @masatheesh msg above. Perhaps that dataset can be added to https://github.com/KaniyamFoundation/Ebooks.
Another list of datasets that contain Tamil: https://opus.nlpl.eu/results/en&ta/corpus-result-table
Downloaded epub "https://freetamilebooks.com/ebooks/rti_2005_guide/" and executed below command
pandoc rti_2005_guide.epub -t plain -o rti.txt
Got attached file. How to take it forward? rti.txt
Hi @masatheesh this is fine. We have to do for all the books.
https://fte.mohan43u.space/books.json Here is a list of json for all the books.
counter stats is here - https://fte.mohan43u.space/
Use this https://fte.mohan43u.space/books.json to get the URL of epub, download all the epub files and convert to text in a automated way.
Share the text files and scripts.
Count me in..!
Collected wikipedia articles as one big text file and published here https://kaniyam.cloudns.nz/tamil_datasets/
details are here Collecting content for LLM dataset – Part 1 – Tamil wikipedia content
Collected all data from FreeTamilEbooks.com
details are here
Collecting content for LLM dataset – Part 2 – FreeTamilEbooks
https://goinggnu.wordpress.com/2024/06/16/collecting-content-for-llm-dataset-part-2-freetamilebooks/
https://github.com/KaniyamFoundation/ProjectIdeas/issues/198
data is here https://kaniyam.cloudns.nz/tamil_datasets/fte-books/
What if we translate English datasets to Tamil ?
Obviously, only 60-80% would be translated to proper Tamil but if we could have a thesaurus then we can translate jargons unknown to translator, parse the text for grammar and others i.e if we have the tools then we can filter those datasets and thus we can get a refined >90% tamil text data in large amounts.
(It is just an idea and those estimates are my guesses)
Collected and added all these below works in this repo (Few Datasets are very huge in size! So, added the links to them): https://github.com/velkadamban/Tamil-Corpus
Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0) Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0) Oscar 23.01 Tamil Meta Data (CC BY 4.0) Project Madurai (Open to use and Distribute) Tamil Wikisource books (CC BY-SA 4.0) Tamil Mann Nationalized Books (CC BY-SA 4.0) Leipzig Corpus CC-100 Corpus Ai4Bharat ( CC- 0) Alpca-ora Translated for Tamil (GPL-3.0)
Can take all the proofread books from ta.wikisource.org. Then Tamil content from tva and project Madurai.
On Sun, 16 Jun 2024 at 3:05 PM, Velkadamban (A) Daarwin Kanna < @.***> wrote:
Collected and added all these below works in this repo (Few Datasets are very huge in size! So, added the links to them): https://github.com/velkadamban/Tamil-Corpus
Tamil Wikipedia articles upto 01.06.2024 (CC BY-SA 4.0) Charles University English-Tamil Parallel Corpus (CC BY-NC-SA 3.0) Oscar 23.01 Tamil Meta Data (CC BY 4.0) Project Madurai (Open to use and Distribute) Tamil Wikisource books (CC BY-SA 4.0) Tamil Mann Nationalized Books (CC BY-SA 4.0) Leipzig Corpus CC-100 Corpus Ai4Bharat ( CC- 0) Alpca-ora Translated for Tamil (GPL-3.0)
— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171322416, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRG7E3MUW5IYWO54ODDZHVL5NAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGMZDENBRGY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
https://github.com/velkadamban/Tamil-Corpus/blob/main/Links%20to%202218%20Books%20in%20TVA
This file contain 2218 links of Nationalized Tamil Book PDFs uploaded in TVA
For machine learning what is the use of image pdf? need tamil text right?
On Sun, Jun 16, 2024 at 3:27 PM Velkadamban (A) Daarwin Kanna < @.***> wrote:
https://github.com/velkadamban/Tamil-Corpus/blob/main/Links%20to%202218%20Books%20in%20TVA
This file contain 2218 links of Nationalized Tamil Book PDFs uploaded in TVA
— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171366672, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRFUQUMHQLDXIIF45YDZHVOSFAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGM3DMNRXGI . You are receiving this because you commented.Message ID: @.***>
Yeah we need to OCR thoese PDFs.. எல்லாமே pdf ஆ தான் இருக்கு. OCR செய்து Text ஐ எடுக்கணும்...
There will be lot of mistakes in ocr. Better to take text from proofread material in Wikisource
On Sun, 16 Jun 2024 at 4:29 PM, Velkadamban (A) Daarwin Kanna < @.***> wrote:
Yeah we need to OCR thoese PDFs.. எல்லாமே pdf ஆ தான் இருக்கு. OCR செய்து Text ஐ எடுக்கணும்...
— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2171433597, or unsubscribe https://github.com/notifications/unsubscribe-auth/AESGXRC5GGLTVWKSMOD5L6DZHVVXHAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNZRGQZTGNJZG4 . You are receiving this because you commented.Message ID: @.***>
எனக்கு ஒரு ஐயம்! தமிழுக்கான Text தரவுகளை நாம் சேகரிக்கிறோம்... ஆனால் கிரந்த எழுத்துக்களின் பயன்பாடும் பிற மொழி சொற்களின் பயன்பாடும் இணையத்தில் கிடைக்கும் தரவுகளிலும் பல நூல்களிலும் மிகுதியாக உள்ளதே! இந்திய மொழிகள் அனைத்துமே சங்கத கலப்பு உள்ளவை தான்! அவற்றின் இலக்கணமும் அதை அனுமதிக்கும்! ஆனால் தமிழ் மட்டும் தானே அப்படி இல்லாமல் வேறு மொழிகளின் கலப்பின்றி செயல்படும் ஆற்றல் பெற்றது!
தமிழை கணினி பயன்பாட்டில் எந்த அளவு கொண்டு வர எண்ணுகிறோமோ, அந்த அளவிற்கு தனித்தமிழில் கொண்டுவரவேண்டும்!
Text extracted from a few Tamil theses on Shodhganga is available here. Contains 4.5 Lakhs Words. It is licensed under CC-BY-SA-NC. Shodhganga, with over 5,000 Tamil theses, may contain more than 12 Lakh pages and 20 Crore words. This data is useful for Tamil LLMs.
https://github.com/vanangamudi/cholloadai-2021 https://archive.org/details/cholloadai-2021.txt
சொல்லோடை தரவு தொகுப்பின் முதல் பதிப்பு. 2021இல் தமிழ் இணைய மாநாட்டில் வெளிவந்த "சொல்லோடை: கற்கும்-கருவிகளுக்கு ஒரு சொற்றொடர் படையல்" ஆய்வுரையை காணவும்.
Hi Gurulenin,
Thank you for providing the thesis data. We've attempted this numerous times without success, but you've finally made it happen. However, the data only includes 10 theses. Could you help us obtain the rest of the Tamil papers or guide us on how to do it ourselves?
We've only found chapters in PDF format, while you have Unicode text. Could you tell me how you accomplished this? We would appreciate your assistance if getting the data in PDF is also an option.
Thank you.
Regards,
Ingersol Selvaraj http://ingersol.no/personal.html
Hai,
These theses are directly obtained from research scholars who provided them in Word format. Some theses were in Bamini font. I converted them to Unicode format and uploaded them here.
On Thu, Jul 4, 2024 at 12:35 PM Ingersol Norway @.***> wrote:
Hi Gurulenin,
Thank you for providing the thesis data. We've attempted this numerous times without success, but you've finally made it happen. However, the data only includes 10 theses. Could you help us obtain the rest of the Tamil papers or guide us on how to do it ourselves?
We've only found chapters in PDF format, while you have Unicode text. Could you tell me how you accomplished this? We would appreciate your assistance if getting the data in PDF is also an option.
Thank you.
Regards,
Ingersol Selvaraj
Mechanical Project Engineer
Konvallvegen 92 | 4700 Vennesla | Norway
M +47 46 24 90 46
http://ingersol.no/personal.html http://ingersol.no/personal.html
E @. @.>*
This message, including any attachments, is intended only for the addressee and may contain privileged or confidential information. Any unauthorized disclosure is strictly prohibited. Thank you.
On Fri, Jun 21, 2024 at 3:59 PM gurulenin @.***> wrote:
Text extracted from a few Tamil theses on Shodhganga is available here. Contains 4.5 Lakhs Words. It is licensed under CC-BY-SA-NC. Shodhganga, with over 5,000 Tamil theses, may contain more than 12 Lakh pages and 20 Crore words. This data is useful for Tamil LLMs.
https://github.com/gurulenin/Shodhganga_tamil_thesis
— Reply to this email directly, view it on GitHub < https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2182810603>,
or unsubscribe < https://github.com/notifications/unsubscribe-auth/AS4MHZ2SZEMV53SEYPJFUWTZIQWTDAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOBSHAYTANRQGM>
. You are receiving this because you commented.Message ID: @.***>
— Reply to this email directly, view it on GitHub https://github.com/KaniyamFoundation/ProjectIdeas/issues/198#issuecomment-2208262915, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADU2EHSKW2JEP3NPOODXLVDZKTX2ZAVCNFSM6AAAAABD37I4R2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMBYGI3DEOJRGU . You are receiving this because you commented.Message ID: @.***>
I would like to contribute.
varunushamurali@gmail.com
வணக்கம் சீனிவாசன் ஐயா,
தமிழில் எழுத்து வடிவிலான தரவுகளை பெரு மொழிப் போல்மம் (LLM) உருவாக்குவதை போல, ஒலி வடிவிலான பெரு மொழிப் போல்மம் (LLM) உருவாக்க வேண்டும் ஐயா. அதற்கான தரவுகளை சேகரிப்பதற்கு, GitHub மூலமாக ஒரு புதிய வாத செய்தியை (Issue) உருவாக்குங்கள் ஐயா. அதில் எழுத்து வடிவிலான தரவுகளை வாசித்து , அதனை பதிவு செய்து, ஆடியோ போர்மட், குறிப்பாக (mp3,flac,ogg) போன்ற ஒலிக்கோப்பு படிவத்தில் பதிவேற்றம் செய்யலாம் ஐயா. தற்போது தெலுங்கானா மாநிலத்தில், மக்கள் மூலமாக தெலுங்கானா அரசு தெலுங்கு மொழில் இதனை செய்து இருக்கிறது.
I'm excited to contribute to this project. I am available to assist with the automatic data scraping task.
Email Address: kamaleshpeast@gmail.com
Here are my blog posts on collecting open licensed data for tamil llm works.
part 2 – https://goinggnu.wordpress.com/2024/06/16/collecting-content-for-llm-dataset-part-2-freetamilebooks/
Get all the data from https://kaniyam.cloudns.nz/tamil_datasets/
To build LLMs, we need huge volume of tamil text data. in TBs
So far we dont have TBs of data available in Tamil, with open license.
There are many researchers who spend huge amount of time in scrapping websites and try to build LLMs. Due to the copyright issues, they can not share the data in public.
To solve this, we have to do the below things
Plan:
Execution:
We have to do the above for few years regularly.
there are many individuals who collected many MBs to GBs of open licensed data and published in github, kaggle etc. We can curate them and add in the collection too.
If you are interested in contributing for this project, for any of the above activities, please comment here with your email address and what task you can work on.