cltk / cltk

The Classical Language Toolkit
http://cltk.org
MIT License
833 stars 329 forks source link

Scraping data from sanskrit-documents. #211

Closed coderbhupendra closed 8 years ago

coderbhupendra commented 8 years ago

I'm scrapping The Rig Veda for sanskrit sentences. It contains 191 documents , I have made parallel-documents from it .Results can be seen here dataset and this is the code for scrapper

Now I want to implement one thing: open this page: Rig Veda Book 1 Hymn 21 In this first there is a poem in Sanskrit and then in English, so can someone suggest me a good method to just count the number of lines in Sanskrit.

mineshmathew commented 8 years ago

@coderbhupendra the number of sanskrit lines can be counted easily by a regex to identify if a line starts with a char in Devanagari unicode block. (0900–097F)

coderbhupendra commented 8 years ago

thanks @mineshmathew will implement this :)

mineshmathew commented 8 years ago

@coderbhupendra the text below the Sanskrit hymn is a transliteration, not English translation right? What is the point of collecting a parallel corpus of transliterations? If you intend to have a parallel corpus for Machine Translation you ought to have English translations, not transliterations

coderbhupendra commented 8 years ago

@mineshmathew yeah you are correct , am not taking transliteration as translation. For every sanskrit doc there also corresponding a english doc , see at top there is a link as "English" to get its translation.

coderbhupendra commented 8 years ago

@mineshmathew for instance in some documents number of lines are not equal in both English and its sanskrit version so i need to discard them. doc 1074 ,sanskrit version doc 1142 ,sanskrit version doc 1187,sanskrit version and in some case sentences are not completely written http://sacred-texts.com/hin/rvsan/rv01078.htm

mineshmathew commented 8 years ago

@coderbhupendra ok . I didnt see the english translations. But can we do an alignment easily? especially when it is not sentence to sentence translated. Also earlier I mentioned that you can filter by checking if the fist character in a line falls in the Devanagari Unicode block. This might not work in all cases, say if a sentence begin by a punctuation. So it is advisable to check if a line has any Devanagari character other than 'danda '(0964) and 'double danda' (0965). Those lines can be deemed as Sanskrit text.

coderbhupendra commented 8 years ago

@mineshmathew These are documents Sanskrit Hymes English Hymes

And all these are line to line translations you may also check.Yeah its good suggestion to check for '|' and "||" I will do that.

coderbhupendra commented 8 years ago

@kylepjohnson after making all changes , I have added my repo under CLTK: Sanskrit_Parallel_Corpus. I hope you meant this only. Or you meant to add this as To-add-a-corpus-to-the-CLTK

coderbhupendra commented 8 years ago

@kylepjohnson I have scrapped 10 Rig-Veda books , now you can see all 10 folders .But now when I'm trying to push changes to my repo under CLTK its giving error as remote:

Permission to cltk/Sanskrit_Parallel_Corpus.git denied to coderbhupendra. fatal: unable to access 'https://github.com/cltk/Sanskrit_Parallel_Corpus.git/': The requested URL returned error: 403

I think i dont have permission to push changes to this repo again.

kylepjohnson commented 8 years ago

Thank you!

kylepjohnson commented 8 years ago

BTW this looks really excellent

coderbhupendra commented 8 years ago

@kylepjohnson am not able to push into repo. Its giving same above error. I think you need to add me as a collaborator in this repo.

And i could not understand what do you mean by "writable by the sanskrit group". And i cant change rename this repo, as i dont have settings option in this.

kylepjohnson commented 8 years ago

Ah, I thought you were a member already: https://github.com/orgs/cltk/teams/sanskrit-members

You should be good now!

coderbhupendra commented 8 years ago

@kylepjohnson I'm not having option to change the name of my repo under CLTK , there is no "setting" options for it .I think only owner of CLTK can rename the repo under them. And i have added License section in README , please see if this is what you wanted.

kylepjohnson commented 8 years ago

Ok, I just renamed it. I'll check the rest later.

coderbhupendra commented 8 years ago

ok thanks .

coderbhupendra commented 8 years ago

@kylepjohnson now working on to remove lines which repeat in individual documents. for e.g in this poem 4 lines are same.

kylepjohnson commented 8 years ago

PR #216 merged. Thanks! I'll close this issue, but keep doing any cleanup necessary.

Let me know when you're ready for a new topic.