BuddhaNexus / buddhanexus

Backend for the Buddhanexus project
8 stars 2 forks source link

Preparation of tokanized pali files #58

Closed ayya-vimala closed 4 years ago

ayya-vimala commented 5 years ago

SuttaCentral Bilara input data: https://github.com/suttacentral/bilara-data Are converted to tokanized data using /bin/tokanize.py

Currently all files are done except the Vinaya files. Waiting for those to be segmented by the SC team.

sebastian-nehrdich commented 5 years ago

Recently I had the opportunity to take a class on reading Pāli and Sanskrit in Kyōto again (after some years of break on Pāli) and there are a few things that come to my mind at this point:

ayya-vimala commented 5 years ago

however, the disadvantage with Pāli is that we don't have many structually close parallels

I would not say that. The Dhammapadas and Udarnavaggas are very closely related. And that's still something on my to-do list to make a list of all of those (it is in the parallels tables on SC) and there are also other texts that have some close parallels.

For instance: pali:

Manopubbaṅgamā dhammā, manoseṭṭhā manomayā; Manasā ce paduṭṭhena, bhāsati vā karoti vā; Tato naṃ dukkhamanveti, cakkaṃva vahato padaṃ.

sanskrit

manaḥpūrvaṅgamā dharmā manaḥśreṣṭhā manojavāḥ । manasā hi praduṣṭena bhāṣate vā karoti vā । tatas taṁ duḥkham anveti cakraṁ vā vahataḥ padam ।।

prakrit

manopūrvvaṁgamā dhammā manośreṣṭhā manojavā । manasā ca praduṣṭena bhāṣate vā karoti vā । tato naṁ dukham anneti cakram vā vahato padaṁ ।।

gandhari

maṇopuvagama dhama maṇośeṭha maṇojava maṇasa hi praduṭheṇa bhaṣadi va karodi va tado ṇa duhu amedi cako va vahaṇe pathi ◦

So that's just the first verse. You don't have to be fluent in these languages to see that they are very closely related! There might not be hundreds of texts in total but there is a lot anyway.

Next to that we have a dictionary pali-sanskrit that lists over 4000 words. But you can also see some things that are happening between the languages anyway. For instance dharma-dhamma, karma-kamma, etc. Sanskrit 'rm' always is 'mm' in pali.

However, right now I think we should concentrate on just getting the pali-pali and later on deal with the pali-sanskrit. I think this issue is just about pali-pali and it is about tokanizing them, which is basically on my plate. (i.e. tokanizing the segmented SC files for the neuronal network) You already have everything to do some runs with that and pick out mistakes. The only thing that still needs to be tokanized are the rest of the Vinaya files and they have to come in segmented form from SC first.

ayya-vimala commented 5 years ago

This is a bit of an ongoing issue but latest files have just been produced.

sebastian-nehrdich commented 5 years ago

I just tried to run your newest set of files and my python installation is segfaulting at the moment. I assume that this is happening because I have updated the nvidia-driver in the last days and now the fragile CUDA/NVIDIA/TensorFlow stack is disturbed. Is it OK for you to wait with the new Pāli until November? I guess I will have to do a clean rebuild of my software setup at that point anyway, and after that it should be finde to recalculate the Pāli...

On Sat, Oct 12, 2019 at 8:07 PM Ven. Vimala notifications@github.com wrote:

This is a bit of an ongoing issue but latest files have just been produced.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ayya-vimala/pali-networks/issues/2?email_source=notifications&email_token=AEPC7GBEOFVYWHGMQ5N4XATQOGVXHA5CNFSM4HY5ARJ2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEBB43MY#issuecomment-541314483, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEPC7GHB3H6ERAMTMYCJX3DQOGVXHANCNFSM4HY5ARJQ .

ayya-vimala commented 5 years ago

I see. Sure, if it is problematic than just leave it for a while.