erc-dharma / odia-mahabharata

ସାରଳା ଭାରତ (Sāraḷā Bhārata)
0 stars 0 forks source link

Various points #1

Open michaelnmmeyer opened 8 months ago

michaelnmmeyer commented 8 months ago

To help identify problems, I created four additional files. malten_full.txt, iast_full.txt and ori_full.txt hold the full text of the respective versions, concatenated into a single file, and without markers. missing_full.txt is a new version of missing_alpha.txt.

There are about 40 less lines in malten_full than in iast_full and ori_full, probably because some markers from malten_full where incorrectly transliterated. In malten_full itself, there are quite a few weird character sequences, like "coÔøΩkhaÔøΩ", which are preserved in iast_full but obviously should have been transliterated to something more meaningful.

Concernins scans, they take up a lot of space, so I suggest we host them somewhere else than in a git repository. I will make PDFs out of them and send you the files.

Now some specific points:

(1) OK for ẏ and ḷ.

(2) We have both "b" and "w" transformed to "ବ". Is this normal? Should I also modify the "iast" version to use a single of these characters?

(3) For "ṛ", there are indeed several cases where it is preceded by a vowel. I found the following:

jṛ
tṛ
hṛ
nṛ
pṛ

Should I replace these with jr̥, etc. in the IAST version, globally? For other instances of "ṛ", where it represents ଡ଼, should this symbol be treated as a consonant viz. if a vowel follows, should it be represented as an initial vowel or as a vowel mark in Oriya script?

(4) The transliteration issues you spotted in Text_files_problems.txt are harder to tackle, I need more specific transformation rules. For instance we have pUboGge > pUrbe but also eboGg > ebaM, I do not know the criteria to decide whether boGg > rb or boGg > aM.

arlogriffiths commented 8 months ago

Thanks. Can you update the README in the light of the of you say in the first few paragraphs of the above issue?

(2) Odia script makes no distinction between b and v and there is no graphic distinction between w and v either. I would personally have transliterated all cases of w and b in the source data as v. Please convert all cases of w to v but leave cases of b to be looked at later. (I don't know whether any pattern can be discerned in the typist's choices.)

(3) jṛ — only one case. It may be a typo for jra but for now let's make it jr̥ tṛ — only one case. tṛtīẏa >> tr̥tīẏa hṛ — two cases. replace both by hr̥ nṛ — three cases. replace all by nr̥ pṛ — two cases. replace both by pr̥. in the second word, correct the pṛthrabīki to pr̥thibīki

(4) could you try to execute the steps I have formulated? I don't see any other way than to eliminate the typist's confusions more or less on a case-by-case basis.

An additional point from my side: for compliance with ISO15919/DHARMA, please change all cases of ṃ to ṁ in the source txt files.

michaelnmmeyer commented 8 months ago

OK for everything, except the treatment of ṛ and of clusters of vowels. Can you please correct the following assertions? They are probably incorrect.

Whenever "ṛ" is followed by a vowel, convert it to "ଡ଼" in the Oriya output. But keep "ṛ" in the iso transliteration.

Whenever we have a consonant followed by several vowels (except for consonant + ai/au + something that is not a vowel), all but the first vowel must take their initial form in the Oriya output. We assume that vowels after the first one each hold one character in transliteration: for instance, diai must be interpreted as d, i, a, i, not d, i, ai. In transliteration, each vowel after the first one must be represented in capitals.

arlogriffiths commented 8 months ago
  1. "Whenever "ṛ" is followed by a vowel, convert it to "ଡ଼" in the Oriya output. But keep "ṛ" in the iso transliteration." I think this is correct. Only the above cases jṛ etc. are to be converted to r̥ in DHARMA-ISO. And whenever ṛV occurs, ṛ must ve a consonant akṣara and hence be converted to ଡ଼ in Oriya.

  2. I would restate the rest as follows: a. all cases of ai and au are ambiguous (whether ai/au or aI/aU is intended) and need to be cleaned up progressively by identifying frequently occurring words and make the necessary replacements per word, then manually correcting less frequent cases b. all other C+vowel+vowel combinations have to be interpreted as lower-case vowel plus upper-case vowel in DHARMA-ISO (e..g., nei > neI, basāile > basāIle) c. all vowel+vowel+vowel combinations must be interpreted as lower-case vowel plus lower-case vowel plus upper-case vowel, because (DHARMA-ISO independent vowels) Ai and Au, anyhow presumably very rare, I expect only to occur at word-beginning and always to be followed by a consonant, if they occur at all. by this logicl, diai has to be interpreted as diAI = ଦିଅଇ

I hope this restatement is helpful. As I am not very familiar with medieval Oriya, some of my assumptions may be wrong. Before making batch replacements, it may be safest to supply me a list of all cases of space+a+i and space+a+u so I can try to confirm my assumptions under 2c.