Preparation of tokanized pali files

ayya-vimala commented 5 years ago

SuttaCentral Bilara input data: https://github.com/suttacentral/bilara-data Are converted to tokanized data using /bin/tokanize.py

Currently all files are done except the Vinaya files. Waiting for those to be segmented by the SC team.

sebastian-nehrdich commented 5 years ago

Recently I had the opportunity to take a class on reading Pāli and Sanskrit in Kyōto again (after some years of break on Pāli) and there are a few things that come to my mind at this point:

even though many people will claim that Tibetan translations of Sanskrit texts are 'so literal' or 'mechanical', we need to keep in mind that Sanskrit and Tibean linguistically are as different as two languages can be (grammar and phonetics have basicly nothing to do with each other, only the writing system is somehwat related)
Pāli and Sanskrit however are rather close relatives, so linguistically speaking, going from Sanskrit to Pāli is much easier than going form Sanskrit to Tibetan
however, the disadvantage with Pāli is that we don't have many structually close parallels. In the case of Sanskrit and Tibetan, we have hundreds of digitally available texts that are translations of each other (many of them however not complete) so we can leverage a lot of training data and then use heavy-duty networks such as transformer based masked auto-encoders in order to overcome the linguistical differences of the languages by long long training runs
for Pāli and Sanskrit, that's not really possible and might even not be necessary
maybe we can get away with a lightweight approach that combines a few factors:
- most of the phonetic changes between Sanskrit and Tibetan seem to be a kind of simplification of the Sanskrit phonetics (for examples, see: https://www.ancient-buddhist-texts.net/Textual-Studies/Grammar/Transforming-Sanskrit-into-Pali.htm)
- same goes to Sandhi - I don't want to say that Pāli Sandhi is less complex, but it appears to me that there seem to be some differences which can be broken down into mechanical rules (Achim Fahs' Pāli grammar is very helpful for these questions)
- we could therefore try to determine a set of fixed rules of how to reduce Sanskrit phonetic/Sandhi complexity, then use REGEX and replace the Sanskrit etexts accordingly. In this way, we can create 'fake-middle-indic' versions of the GRETIL Sanskrit texts which we can then run through lightweight neuronal networks (for example fasttext)
- apart from the phonetic changes, the grammatical structure of Sanskrit and Tibetan seems to be almost identical to me (regarding common word order and the use of cases). Some things have happend (e.g. collapse of the dative in favor of genitive and general change/reduction of the case endings) but these shouldn't worry us too much
- more problematic is the collapse of the Sanskrit verbal system when it comes to Pāli. This is where things really get a bit messy, since I vaguely remember that Pāli lost most of the past tense forms with the exception of the aorist, which then again is rather little seen in Buddhist Sanskrit texts (which seem to prefer the perfect at times). Anyway, I think that regarding verbs there really is nothing much we can do - maybe it is possible to replace some of the most commonly used Sanskrit verbs with their Pāli equivalents (e.g. bhavati -> hoti, abhavat -> ahosi), but we need to determine these cases manually and I assume that apart from a handful of clear examples, we cannot do very much regarding this.
- to sum it up: In order to be successful with the Sanskrit<>Pāli transformation, it would be very helpful if we can determine which changes from Sanskrit to Pāli can be applied mechanically. Additionally, a Sanskrit<>Pāli dictionary could b very very useful. this might be already enough to train a network which can mine the corpus for common sequences

ayya-vimala commented 5 years ago

however, the disadvantage with Pāli is that we don't have many structually close parallels

I would not say that. The Dhammapadas and Udarnavaggas are very closely related. And that's still something on my to-do list to make a list of all of those (it is in the parallels tables on SC) and there are also other texts that have some close parallels.

For instance: pali:

Manopubbaṅgamā dhammā, manoseṭṭhā manomayā; Manasā ce paduṭṭhena, bhāsati vā karoti vā; Tato naṃ dukkhamanveti, cakkaṃva vahato padaṃ.

sanskrit

manaḥpūrvaṅgamā dharmā manaḥśreṣṭhā manojavāḥ । manasā hi praduṣṭena bhāṣate vā karoti vā । tatas taṁ duḥkham anveti cakraṁ vā vahataḥ padam ।।

prakrit

manopūrvvaṁgamā dhammā manośreṣṭhā manojavā । manasā ca praduṣṭena bhāṣate vā karoti vā । tato naṁ dukham anneti cakram vā vahato padaṁ ।।

gandhari

maṇopuvagama dhama maṇośeṭha maṇojava maṇasa hi praduṭheṇa bhaṣadi va karodi va tado ṇa duhu amedi cako va vahaṇe pathi ◦

So that's just the first verse. You don't have to be fluent in these languages to see that they are very closely related! There might not be hundreds of texts in total but there is a lot anyway.

Next to that we have a dictionary pali-sanskrit that lists over 4000 words. But you can also see some things that are happening between the languages anyway. For instance dharma-dhamma, karma-kamma, etc. Sanskrit 'rm' always is 'mm' in pali.

However, right now I think we should concentrate on just getting the pali-pali and later on deal with the pali-sanskrit. I think this issue is just about pali-pali and it is about tokanizing them, which is basically on my plate. (i.e. tokanizing the segmented SC files for the neuronal network) You already have everything to do some runs with that and pick out mistakes. The only thing that still needs to be tokanized are the rest of the Vinaya files and they have to come in segmented form from SC first.

ayya-vimala commented 5 years ago

This is a bit of an ongoing issue but latest files have just been produced.

sebastian-nehrdich commented 5 years ago

I just tried to run your newest set of files and my python installation is segfaulting at the moment. I assume that this is happening because I have updated the nvidia-driver in the last days and now the fragile CUDA/NVIDIA/TensorFlow stack is disturbed. Is it OK for you to wait with the new Pāli until November? I guess I will have to do a clean rebuild of my software setup at that point anyway, and after that it should be finde to recalculate the Pāli...

On Sat, Oct 12, 2019 at 8:07 PM Ven. Vimala notifications@github.com wrote:

This is a bit of an ongoing issue but latest files have just been produced.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ayya-vimala/pali-networks/issues/2?email_source=notifications&email_token=AEPC7GBEOFVYWHGMQ5N4XATQOGVXHA5CNFSM4HY5ARJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBB43MY#issuecomment-541314483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEPC7GHB3H6ERAMTMYCJX3DQOGVXHANCNFSM4HY5ARJQ .

ayya-vimala commented 5 years ago

I see. Sure, if it is problematic than just leave it for a while.

BuddhaNexus / buddhanexus

Preparation of tokanized pali files #58