drdhaval2785 / SanskritSpellCheck

spell checking based on patterns
1 stars 1 forks source link

Count 100 vowels vs. 130 consonants #5

Closed gasyoun closed 9 years ago

gasyoun commented 9 years ago

Can we take some GRETIL's texts (GRETIL_ALL_2013-10-09_UTF8_FOR_PERSONAL_USE_ONLY.zip) and count how may V are there on 100 occurrences of C in a Sanskrit text? Like 100 vowels vs 130 consonants, please. I managed only Rigveda and even that I'm unsure about. Because I did not use SLP1, that is the reason, only HK, so the results are dirty. What is the real ratio? Sample SLP1 text: https://github.com/drdhaval2785/SanskritSpellCheck/blob/master/meghadhuta-CVC-SLP1.txt

drdhaval2785 commented 9 years ago

Done. Input file : https://github.com/drdhaval2785/SanskritSpellCheck/blob/master/meghadhuta.txt Run https://github.com/drdhaval2785/SanskritSpellCheck/blob/master/countvowels.php

Output -

Occurrence of vowels - 8304 Occurrence of consonants - 11519 Ratio of consonants per 100 vowels -138.71395026793

gasyoun commented 9 years ago

Fantastic job, well done, lovely. Looking for bigger text samples: 13,119.008a sa tatheti pratiśrutya kīṭo vartmany atiṣṭhata 13,119.008b_0599_01 śakaṭavrajaś ca sumahān āgataś ca yadṛcchayā 13,119.008b_0599_02 cakrākrameṇa bhinnaś ca kīṭaḥ prāṇān mumoca ha 13,119.008b*0599_03 saṃbhūtaḥ kṣatriyakule prasādād amitaujasaḥ 13,119.008c tam ṛṣiṃ draṣṭum agamat sarvāsv anyāsu yoniṣu I'll have to try to find a way to weed out the numbers, otherwise abcd at end of numbers might hurt.

drdhaval2785 commented 9 years ago

Bad luck. Regex ahead. Beware

gasyoun commented 9 years ago

Regex finished.

Atharvaveda Occurrence of vowels - 207230 Occurrence of consonants - 275159 Ratio of consonants per 100 vowels -132.7794412502

Meghadhuta Occurrence of vowels - 8304 Occurrence of consonants - 11519 Ratio of consonants per 100 vowels -138.71395026793

Ramayana Occurrence of vowels - 620468 Occurrence of consonants - 853343 Ratio of consonants per 100 vowels -137.5321229039

Mahabharata Occurrence of vowels - 3544615 Occurrence of consonants - 4897052 Ratio of consonants per 100 vowels -138.15

1) Can we get the atharvaveda-CVC-SLP1 part from $file=file_get_contents("atharvaveda-CVC-SLP1.txt"); and add it to Ratio of consonants per 100 vowels (per atharvaveda-CVC-SLP1) -132.7794412502 But 5) would make it non-wanted.

2) Can we have 132.78 instead of 132.7794412502, please?

3) "Occurrence of vowels" -> Occurrence of vowels (V)

4) "Occurrence of consonants" -> Occurrence of consonants (C)

5) $file=file_get_contents for all files in a folder, like http://stackoverflow.com/questions/15041608/searching-all-files-in-folder-for-strings

6) When counting mbh-CVC-SLP1 got stuck, showed only Occurrence of vowels - 3544615 and bellow Fatal error: Allowed memory size of 1048576000 bytes exhausted (tried to allocate 36 bytes) in C:\xampp\htdocs\countvowels.php on line 31

memory_limit=128M initial when changed to memory_limit=-1 did not launch the Apache server. So $split1=preg_split('/([kKgGNcCjJYwWqQRtTdDnpPbBmyrlvSzshMH])/',$file,0,PREG_SPLIT_DELIM_CAPTURE); is crashing it on the MBh file :) After that the counting went on for older files, but for all newer I get only Warning: file_get_contents(m.txt): failed to open stream: No such file or directory in C:\xampp\htdocs\countvowels.php on line 13

ini_set("memory_limit","10000M"); helped me to get Occurrence of vowels - 3544615 Occurrence of consonants - 4897052 so I opened my calculator and got 1.38, tada.

drdhaval2785 commented 9 years ago

It seems more of documentation. Anything which remains for me to do? If you have already done some corrections for your needs - pushing it on github may help. Otherwise I treat this issue as closed.

gasyoun commented 9 years ago

@drdhaval2785 right, documentation is there, but 5) & 2) are wanting. 6) is partly fixed with ini_set("memory_limit","100000M"); so I'll push it for Mahabharata.

drdhaval2785 commented 9 years ago

@gasyoun

Point 2) is done in the latest commit. Sample output with meghaduta-cvc-slp1 file is

Occurrence of vowels - 8304
Occurrence of consonants - 11519
Ratio of consonants per 100 vowels -138.71

Point 5) and 6) are not clear. I will let you close this issue if these two are not that important

gasyoun commented 9 years ago

Let's close it.