compliance with TG - Githubissues

arlogriffiths commented 7 months ago

@manufrancis has not so far changed representation of anusvāra with ṃ to ṁ in his editions of Pallava inscriptions and hence these editions are not yet compliant with the DHARMA TG. There may be other points on which he has not brought his transliteration into compliance either. @michaelnmmeyer — Can you make the necessary replacements for him?

manufrancis commented 7 months ago

@michaelnmmeyer Cher Michaël, je m'occupe de ça.

michaelnmmeyer commented 7 months ago

@manufrancis Je peux m'en occuper si tu ne l'as pas déjà fait, c'est vite fait.

@arlogriffiths Quite a lot of people use ṃ ṛ ṝ ḷ ḹ (probably because their input method system uses these characters). I believe it is safe to globally substitute ṃ ṛ ṝ with ṁ r̥ r̥̄, and maybe ḹ with l̥̄, but substituting ḷ with l̥ should only be done for Sanskrit. Is this correct?

manufrancis commented 7 months ago

@michaelnmmeyer

"change ṃ to ṁ" fait pour mes corpus.
at I believe it is safe to globally substitute ṃ ṛ ṝ with ṁ r̥ r̥̄, and maybe ḹ with l̥̄, but substituting ḷ with l̥ should only be done for Sanskrit. As for ḷ with l̥ for sanskrit, well not safe, as we have in Grantha script ḷ (for instance in coḷa or in cuḷā = cūḍā) and l̥.

danbalogh commented 7 months ago

Do wait for Arlo's opinion. As far as I can see, all of those substitutions except ḷ are safe. The consonant ḷ also occurs in Sanskrit other than Grantha, for example many of my Eastern Cālukya inscriptions (in names [including the name Cāḷukya itself] as well as in non-standard spelling of Sanskrit words). So that character should be left alone, but since the actual vocalic l̥ occurs less than once in a blue moon, I don't think that should be a problem.

michaelnmmeyer commented 7 months ago

@manufrancis @danbalogh Oops, OK.

arlogriffiths commented 3 months ago

The vocalic l̥ is actually common in Old Jasvanese inscriptions, as shotthand for the syllable lə. Conversely, retiorflex consonant ḷ does not occur there at all. So I'd expand the scope of Manu's rule "substituting ḷ with l̥ should only be done for Sanskrit" to "substituting ḷ with l̥ should only be done for Sanskrit and Old Javanese".

I agree with all Manu and Dan have said.

arlogriffiths commented 2 months ago

I think I have to correct my answer of last month. Since, as Dan previously pointed our, in some cases ḷ is used in Sanskrit, my reformuation of the rule to "substituting ḷ with l̥ should only be done for Sanskrit and Old Javanese" was wrong.

Cases of ṃ ṛ ṝ may occur in quotations and should not necessarily always be replaced with ṁ r̥ r̥̄. Any such replacements that are made should be limited to text and apparatus nodes of our xml files.

In brief, I think in some repos known to contain cases of non compliance with TG, cases-by-case replacements can be made with due limitation to specific nodes. But I don't think it's a good idea to implement to implement any global replacement rules applying to all parts of all files. And basically everybody should (be trained to) comply with TG and cases of non-compliance gradually polished away.

What is the status of your work on this issue, @michaelnmmeyer? Can we close it?

michaelnmmeyer commented 2 months ago

Fixing transliteration issues is too complicated to be implemented reliably, thus I will leave it to authors.

arlogriffiths commented 2 months ago

Perhaps you could nevertheless generate per repo a list of occurrences of ṃ ṛ ṝ ḷ ḹ? That will help encoders and PIs to follow up and weed out any cases of non-compliance with TG.

michaelnmmeyer commented 2 months ago

About 1,000 texts (1/3 of our collection) use these characters, so generating a basic list would not be useful. I will try to find something for prioritizing repos and texts to check.

arlogriffiths commented 2 months ago

thanks. for me (and I guess for most team members) it is easy to do multifile search at repo level, but not higher. so if I know that in a given repo, for which I am responsible, there are instances of the offending characters, I can search them and weed them out.

michaelnmmeyer commented 2 months ago

Here is a list. Numbers within brackets represent respectively:

A. Number of occurrences of ṃ ṛ ṝ ḷ ḹ B. Number of occurrences of ṁ r̥ r̥̄ l̥ l̥̄

We expect to find ṃ ṛ ṝ ḷ ḹ much more rarely than ṁ r̥ r̥̄ l̥ l̥̄, so repositories where this is not the case are more likely to present encoding issues. However, languages are not taken into account, so this can be very wrong (as for tfa-pallava-epigraphy).

[ ] > tfa-sii-epigraphy (22447, 2150)
[ ] > tfa-pallava-epigraphy (2202, 462)
[ ] > tfb-maitraka-epigraphy (1289, 2632)
[ ] > tfa-cirkali-epigraphy (760, 11)
[ ] > tfa-tamilnadu-epigraphy (740, 5)
[ ] > tfb-daksinakosala-epigraphy (582, 1555)
[ ] > tfb-telugu-epigraphy (573, 604)
[x] tfb-vengicalukya-epigraphy (548, 9715)
[ ] > tfb-somavamsin-epigraphy (518, 827)
[ ] > tfb-kalyanacalukya-epigraphy (429, 1011)
[ ] > tfb-karnataka-epigraphy (390, 371)
[ ] > tfc-khmer-epigraphy (271, 15269)
[ ] > tfa-melappaluvur-kilappaluvur-epigraphy (136, 0)
[ ] > tfb-bengalcharters-epigraphy (94, 3630)
[ ] > tfa-kotumpalur-epigraphy (94, 0)
[ ] > tfa-uttiramerur-epigraphy (93, 3)
[ ] > tfb-eiad-epigraphy (89, 2760)
[ ] > tfa-tiruvavatuturai-TN-epigraphy (79, 14)
[x] tfb-badamicalukya-epigraphy (62, 522)
[x] siddham (50, 7626)
[ ] > tfc-nusantara-epigraphy (33, 14752)
[ ] > tfa-pandya-epigraphy (25, 8)
[ ] > tfa-tamil-outside-TN-epigraphy (22, 0)
[ ] > tfc-campa-epigraphy (19, 1295)
[ ] > tfa-cempiyan-mahadevi-epigraphy (15, 0)
[x] > tfb-bhaumakara-epigraphy (8, 13)

danbalogh commented 2 months ago

I just want to add that silently normalising quoted transliteration to our transliteration scheme, even in block quotes, is acceptable to me. I'm not saying that we should do it, but if we did, then batch-replacing ṃ, ṝ and ḹ to ṁ, r̥̄ and l̥̄ would become an option. Alternatively, instances of ṃ, ṝ and ḹ that are not children of a <q> element, could still probably be safely batch-replaced. Then, the only ones to be checked on a case-by-case basis would be ṛ (almost always incorrect for r̥ but very rarely correctly meaning the NIA retroflex flap consonant) and ḷ (in old Javanese and perhaps some related corpora, most likely incorrect for l̥; in some corpora [Dravidian language or Dravidian-influenced Indo-Aryan] usually correctly denoting the retroflex glide but very rarely erroneous for l̥; in all other corpora, presumably always erroneous for l̥).

danbalogh commented 2 months ago

I have checked my own subcorpora (Vengi, Badami and Siddham). I've found:

two instances of ṃ wrongly used instead of ṁ, corrected
two more instances of ṃ in cited Mahābhārata text, also corrected (normalised to our transliteration)
two instances of ṛ in the verbatim replication of a previous editor's uninterpretable transliteration (he probably used ṛ to denote the sound we transliterate as ḻ), which I'm retaining
no instances of ṛ ṝ ḹ
hundreds of ḷ, which I've skimmed, and all seem to be correct for the retroflex glide.

arlogriffiths commented 2 months ago

Thanks. You can presumably also help doing the same check and clean uop for maitraka, daksinakosala, telugu and bhaumakara. Ryosuke does not seem to be receiving github notifications so perhaps you can step in for bengalcharters too. Please ask Samana if she can take care of everything that is Kannada-related.

@amandinebricout : can you do the same kind of check and clean-up as Dan has described above for your somavamsin files? @manufrancis : tfa is your cup of tea.

I'll take care of everything that's tfc plus tfb-eiad.

danbalogh commented 2 months ago

A slight snag in this is that back in the early days we told people in the EGD that if they have difficulty producing r̥ on their keyboards, they can use ṛ instead, and this would be converted automatically. It seems, after looking at the repositories, that a lot of people have availed of this option. I think we can be pretty sure that none of these people have also used ṛ for the NIA retroflex flap, but I think they are in a better position to decide this themselves. So I've done the checking and replacement for bhaumakara (which seems to be just a single file) and I'll start on the telugu, which I guess we can't expect Jens to solve, so I'll take a look and try to sort it out myself. For the others, I'll post here some instructions for how I would do this and ask the relevant people to do it themselves, with my help if they need it.

danbalogh commented 2 months ago

One way to check for the suspect characters in your texts is as follows. @michaelnmmeyer may be able to suggest a better one, but this seems to work.

Do a fresh git pull of your repository.
In Oxygen, open one (any) of the XML files in the folder that contains your editions.
Press shift+ctrl+h (probably cmd+ctrl+h on a Mac, but if that doesn't work, experiment or use "Find/Replace in Files" in the Find menu) to open the dialogue box for multifile search.
Fill this as follows:

under "Text to find", copy and paste: [ṃṛṝḷḹ]
Tick the box for Regular expression and untick all other boxes
under "Restrict to XPath", copy and paste: //text()[not(ancestor::quote or ancestor::q)]

Among the radio buttons under Scope, select "Current file directory". (If your files are in more than one folder, you'll have to repeat this for each folder.)
Click "Find all"

This will give you a result list with all occurrences of any of these characters in any of your files, except those within a <q> or <quote> element and those in XML comments. Skim the result list to see if there are a few dozen results or hundreds. If there are hundreds, see which character is dominant. It will probably be ṛ if you have used it instead of r̥. Also, if ḷ for the retroflex glide is frequent in your texts, then there may be a lot of legitimate instances of ḷ. So you will want to repeat the above search in a more specific way. To search for any specific character, such as ṛ, just put that under Text to find (you don't need the square brackets in this case and you can, but don't have to, untick Regular expression). If you are sure that ṛ is only used in your texts for the sonant and not for the NIA retroflex flap, then you can go ahead and batch replace that to r̥: put the latter under Replace with and click Replace all. Oxygen will offer you a preview option so you can look at the results in a popup window before approving the batch replace, which cannot be undone. Checking the preview carefully is a good idea for two reasons. One is to make sure that you didn't mistype something in the search or replace box, so check that the expected character is being replaced with the corect one. The other is to check for instances where you may have, for example, copied a previous editor's ambiguous transliteration in an apparatus reading, which you want to preserve exactly as is, and not replace with the DHARMA system. Instead of using replace all, you can also click find all, then double-click any item in the result list, upon which the editor will open the relevant file and take you to the locus so you can correct it if needed. If you have not used ṛ for the sonant, or you have already replaced it with r̥, then make your decision about ḷ. If you are sure that ll instances of ḷ in your files are used for the sonant, then you can batch replace it to l̥. If you are sure that all instances are legitimate (i.e. they mean the retroflex glide), then you can leave them as they are. But if there may be instances of both legitimate (retroflex glide) and illegitimate (sonant), then you must unfortunately search for ḷ in all your files, and review the results list manually. Once ṛ and ḷ are both in order, you are left with just the characters ṃ, ṝ and ḹ. You can batch replace these to ṁ, r̥̄ and l̥̄ respectively, but it's probably a good idea to ask for a preview and at least skim the results before approving the change.

danbalogh commented 2 months ago

The Telugu is done, except for one case where ṛ may be a typo for ̱r or an exact reproduction of a previous editor's reading, but certainly not a substitute for r̥ (I've left an XML comment there), and a couple of instances where Jens reproduces a published translation and does so exactly, with same transliteration system used in the published translation. Our guidelines say that transliteration should be silently normalised when reproducing complete translations, but I don't have the capacity to check through all the translations in Jens's files and do this, and changing just ṛ to r̥ would only make them inconsistent, so I did nothing to those.

erc-dharma / project-documentation

compliance with TG #251