chchch / upama

A PHP library for comparing two or more Sanskrit TEI XML files and generating an apparatus with variants
GNU General Public License v2.0
10 stars 1 forks source link

handling odd glyphs #10

Closed wujastyk closed 2 years ago

wujastyk commented 3 years ago

Hi, Charles. I'm working out how to represent some non-standard Newa glyphs. Specifically, "Siddhi" and "Newa gap filler", U+1144A and U+1144E. See the section "Unicode representation" at the end of this blog post.

In my TEI header, I have,

<encodingDesc>
            <charDecl>
                <glyph xml:id="thaayjaayekaa">
                    <localProp name="thaayjaayekaa" value="NEWA GAP FILLER"/>
                    <mapping type="standard">&#x1144E;</mapping>
                    <mapping type="PUA">U+1144E </mapping>
                    <note><ref target="https://unicode.org/charts/PDF/U11400.pdf"></ref></note>
                </glyph>
            </charDecl>
            <charDecl>
                <glyph xml:id="siddhi">
                    <localProp name="siddhi" value="SIDDHI"/>
                    <mapping type="standard">&#x1144A;</mapping>
                    <mapping type="PUA">U+1144E </mapping>
                    <note><ref target="https://unicode.org/charts/PDF/U11400.pdf"></ref></note>
                </glyph>
            </charDecl>
        </encodingDesc>

So, in the text of the MS transcription, if I insert the character simply like this: &#x1144E; then it shows up correctly in the display at Saktumiva.

The problem is that when I collate that MS, the whole line on which the character appears gets ignored or treated as a missing line (with an "omit"). I can't find a way round this, either by using <g> or the character itself (not the &; form).

Is it because I'm inserting a Newa character into an IAST file? That seems the wrong thing to do anyway, but how do I manage this?

While I'm failing to get glyph working, I'm just doing this, which at least gives some sensible output

<choice><abbr type="contraction">1</abbr><expan><ex>siddhir astu</ex></expan></choice>
chchch commented 3 years ago

There shouldn't be a problem with inserting Unicode characters from the Newa block. What about copy-and-pasting the character from here: https://en.wikipedia.org/wiki/Newa_(Unicode_block) ?

That was an interesting blog post (as well as the one by Birgit Kellner). I think maybe you should use a different mapping for the filler character that you describe. I looked recently at a fragment of the Aṣṭasāhasrikā which seems to contain older versions of both the filler character you describe, and the (possibly more recent) filler character described in the Newa Unicode block:

https://tst-project.github.io/mss/Sanscrit_1438.xml

See folio 1v. I've provisionally transcribed the "older" filler character with ? and the "newer" filler character with -.

chchch commented 3 years ago

Ok, so I came up with this solution:

tl;dr — you can use <g ref="#newa-gap-filler"/> and <g ref="#newa-old-gap-filler"/> to get the two different kinds of gap filler characters. sample: image

more details:

wujastyk commented 3 years ago

A thing of beauty:

image

wujastyk commented 3 years ago

Can we have newa-placeholder-mark too?

image

And "siddham" or "siddhi": Bhattarai pp. 102-104

And puṣpikā (Bhaṭṭarai passim), and space fillers (Bhattarai, pp. 222, 225, 228, et passim)

chchch commented 3 years ago

Hmm, so now I'm thinking maybe we should refine our approach so that we can talk about similar signs across scripts. For example, maybe we should do something like <g type="line-filler" rendition="#newa-gap-filler"/> in case we, hypothetically, want to collate these signs across different manuscript cultures in the future. This is especially because the palaeographic situation we're working with, with all these varieties of Nepālākṣarā, (proto-)Bengali, etc. don't map neatly on to the Unicode blocks of "Newa" and "Bengali". For example, we have a variety of "siddham" signs in manuscripts, but in Unicode we have U+1144A NEWA SIDDHI and U+0980 BENGALI ANJI. But in our manuscripts, there really isn't this division between Newa and Bengali.

Maybe we can do something like:

<g type="siddham" rendition="#newa-siddhi"/> <g type="siddham" rendition="#beng-anji"/>

Or we could keep @ref, like so:

<g type="siddham" ref="#newa-siddhi"/> <g type="siddham" ref="#beng-anji"/>

Although, again, in these 11-15th c. North Indic manuscripts we're working with, it doesn't really make sense to differentiate between a "Newa Siddhi" and a "Bengali Anji". Anyway, what do you think?

wujastyk commented 3 years ago

Yes, I think you're right. That will generalize the approach. I also wonder about rendering in different scripts, the old "unicode isn't glyphs" issue. While I love the Newa fillers in the middle of IAST text, it probably isn't quite the right approach. Each writing system should have it's own glyph for the same underlying Unicode code-point. And you already define many of these for Roman transliteration in your Saktumiva Transcription conventions https://saktumiva.org/wiki/transcription; I'd like to go on using those and maybe expand the list just slightly. I like your @type @rendition idea. When there's an IAST convention, we could say

a¦rthaḥ

Or am I muddling things. I think I'm muddling things. Why say "newa" in a line of IAST?

chchch commented 3 years ago

I think it's actually the situation itself which is quite muddy. As you wrote in your blog post, it's important to think about the function of a character, and it seems like, for the moment, what we're calling newa-gap-filler and newa-old-gap-filler have the same function, but it's hard to be sure unless we do more serious palaeography. So it makes sense to preserve both the function of the character (as we understand it currently) as well as its appearance (because they are so different).

Incidentally, I think the "broken daṇḍa" line filler ( ¦ ) actually has a different function from the "newa-gap-filler". The "broken daṇḍa" (and other similar-looking signs) is generally used at the end of a line or before a stringhole space. But the "newa-gap-filler" is never used this way, although as Birgit mentioned, there's no consensus about what it means. So I think we should have two different conventions: one for an "(end-of-)line filler", and another for a "gap filler", equivalent to, for example, Devanāgarī U+A8F9 ( ꣹ ).

Here are some options on how we might represent this:

chchch commented 3 years ago

Just as a quick follow-up regarding the "broken daṇḍa" line filler — I searched through all of my Dravyasamuddeśa transcriptions, and the "broken daṇḍa" has been used exclusively at the end of a line. It's quite nice to be able to do these searches!

wujastyk commented 3 years ago

Your suggestion,

~ for the "normal" gap-filler as defined in the Newa Unicode block, and ~ for the "old-style" one (or whatever transliteration we decide upon)

Seems as good as any to me. Could you fix it so that the ~ doesn't display in Saktumiva?

Do we want the newa-old-gap-filler to appear in the collation apparatus?

chchch commented 3 years ago

Just waiting for @ppasedach to chime in! He's working on similar issues: https://github.com/ppasedach/ratnakara-tei/issues/19

ppasedach commented 3 years ago

Sorry for the late response to this.

One very important manuscript for the Haravijaya is Jaisalmer 408, Jaina Devanagari, 12th century, serving also as an example in Bidur Bhattarai's book (HVM).

For the longer spaces at the beginning and ends and beginnings of lines this manuscript distinguishes between left and right side: https://github.com/ppasedach/ratnakara-tei/issues/19 . They are the first and second ones there. So far we transcribe both with ꣹. They would probably be good candidates for <space>, as they actually fill up a bigger space.

There are two fillers for very short gaps, as before a string hole or at the end of a line written up to its end. I do not understand the difference between the two. Einecke only has the daṇḍa with the stroke on the bottom right side, another one is like a narrow U. I've seen corrections from one to the other. The daṇḍa with the stroke we transcribe as ¦, for the other one we still use a temporary placeholder. Both are probably good candidates for <g>, as they don't take much space.

There's further signs, used for marking areas deemed not suitable for writing etc.

ppasedach commented 3 years ago

Selection_177 The avagraha-like sign is marking an area not suitable for writing. The text is complete here. Probably one wants <space> here. I am not sure if one wants anything to graphically represent the space-fillers here, or just specify the dimension of the space left free.

ppasedach commented 3 years ago

Another manuscript, 19th century Devanāgarī, from RORI, Jodhpur, has many word-boundary markers.

Selection_178

They could be graphically represented in Devanāgarī as a daṇḍa and daṇḍa-avagraha combination (where a word starts with the inherent a attached to the last consonant cluster of the preceding word) under the line, maybe even simply subscript. But I'm not sure how to best express it in TEI, with <g>, <metamark>, or some other way.

wujastyk commented 3 years ago

So what was the siddhānta on this topic of space-fillers above? <g> or <space>? Or are both options going to work in Saktumiva? If it's a choice, I tend to <g>.

chchch commented 3 years ago

So <g> with @ref is already implemented at the moment... I guess we can stick with that until something comes up!

For NEWA SIDDHI and BENGALI ANJI, maybe we can just use the corresponding Unicode characters for now, since they don't seem to have equivalents in other scripts.

wujastyk commented 3 years ago

Sounds good to me. Thanks!

NB @chakrabortydeepro

ankleb commented 3 years ago

sorry, I somehow cannot figure out the final solutions to some questions discussed here:

  1. what do we do about the broken daṇḍa-s? just type ¦ ? or do we do smth like ?

  2. what about the word-boundry markers that @ppasedach mentioned? Something quite similar occurs in our NAK 5-333 all the time: Screenshot 2021-11-09 at 03 09 22 or Screenshot 2021-11-09 at 03 14 00.

  3. Also, there is a similar sign (looks more or less like a comma to me) that is super common in NAK 5-333. Have no statistics at hand, but I guess it's used most of the time to mark off pāda-s comma sep pada

ankleb commented 3 years ago

from the point of how the signs look, 2 (word-boundary) and 3 (pāda etc. markers) look more or less identical to me

ankleb commented 3 years ago

sorry... forgot the syntax in my question about broken daṇḍas. the question was if we type ¦ or smth like <g type="gap-filler" rendition="#newa-gap-filler"/> after all?

wujastyk commented 3 years ago

Dear @ankleb, For the filler that we've been discussing, please use <g ref="#newa-gap-filler"/> and <g ref="#newa-old-gap-filler"/> as described by Charles in the July 21 comment above.

wujastyk commented 3 years ago

Question 1 in the comment above:

The use of the broken daṇḍa ¦ is documented at Saktumiva. It's an end-of-line filler. A bit like the use of a hyphen in Latin-script text. Bhaṭṭarai (p.9) calls it "line-filler sign or hyphen sign used before the string-holes or at the end of the line on the folio"

wujastyk commented 3 years ago

Question 2 in the comment above:

If I understand your question, use <gap/>. You can add information thus:

<gap extent="1" unit="char" reason="insertion in the line above"/>

Thus, for the first folio you show,

sapta<gap extent="5" unit = "char"/>parṇṇāni
pad<<gap extent="5" unit = "char"/>dma

But in these specific cases, we wouldn't actually use <gap/>. In the Suśruta project, we are not recording the size of the gaps around string holes because we can't think of a reason they would be interesting. We're just coding the string holes as "column breaks" using <cb/> because it's quite useful to know that for finding your way around a page (along with line begin <lb/> and page begin <pb/>). We say <cb n="1"/>, to mean the first string hole, etc.

wujastyk commented 3 years ago

Question 2 in the comment above:

I'm not sure. It's a question of thinking about its meaning and then looking in the TEI Guidelines for the most appropriate tag. I suggest, at present, just "punctuation" <pc> with some explanation. So you could say,

<pc function="pāda divider">,<pc>

Later, if we learn more, we can update or change the tag as needed, as long as you are reasonably consistent so that search-and-replace works

chchch commented 3 years ago

For end-of-line daṇḍa, yeah, I think we should use the broken bar character for now, which is also what we used in the Cambridge project. I'm not sure if we want to distinguish between a broken daṇḍa and a daṇḍa with a slash through it.

For pāda dividers — yeah, I talked with Peter about it, I couldn't decide between something like:

<note place="above">|</note>

or

<add place="above">|</add>

I guess it depends how you want it to be treated in the end. Like, do you want those daṇḍas to be grouped with the other things you've tagged as <note>, or the other things you've tagged as <add>? For example, when you make a collation, you can decide whether to keep all the <note> tags or ignore them all; ditto with the <add> tags.

ankleb commented 3 years ago

Thanks a lot for clearing up the broken danda issue.

As for the divider marks. I guess the ones that Peter showed in his post and that I have under nr. 2 are indeed some kind of secondary additions (in Peter's case more secondary than in ours I think, but still), so either <note ..> or <add...> should be fine.

However, I think the ones that I have in my question 3 are somewhere between gap fillers and punctuation signs, perhaps, closer to punctuation. I've been transcribing them as , (comma), but I was wondering if there was anything more systematic/ unified?

For example, Graheli in his Nyāyamañjarī-edition uses some kind of half-daṇḍa sign (or whatever it is) to render any kind of punctuation marks other than the daṇḍas. But in his case, I don't think he tries to represent a specific sign in any of his MSS.

Screenshot 2021-11-09 at 21 54 56

Silk (on p. ix) gives a whole set of different punctuation signs that he finds in his MSS and explains how he represents them in his edition:

Screenshot 2021-11-09 at 21 50 36
ankleb commented 3 years ago

kṣamāṃ yāce!!!

so... are we transcribing these guys as ¦ or <space/> or smth else?

Screenshot 2021-11-10 at 01 01 55
chchch commented 3 years ago

I think that broken daṇḍas and daṇḍas with a slash both serve the same function, right? They generally (almost always) appear at the end of a line? We could either transliterate all the "end-of-line" daṇḍas as the broken bar or do something like:

<g rend="daṇda with slash">¦</g>

In this case we're kind of indicating that they serve the same function, but there are variations in how they're stylized, I guess.

ankleb commented 3 years ago

Thanks @chchch! The solution you propose seems to give rather exhaustive info about the sign. I think I'll be voting for that when our group has the next round of elections.

chchch commented 3 years ago

Since people seem to be incredibly passionate about daṇḍas, I've modified the Devanāgarī font to include variants for the broken daṇḍa and the daṇḍa with a slash next to it. See here:

https://chchch.github.io/PedanticIndic/

I've also updated the page on transcription conventions with all the <g>s that we have so far:

https://saktumiva.org/wiki/transcription

Maybe one day we should come up with some "canonical" names for these signs, and then we can just search-and-replace them all.

chakrabortydeepro commented 1 year ago

@chchch Hello Charles,

Could you please add two more Sharada characters?

sharada sign siddham [U+111DB] "𑇛" sharada section mark-1 [U+111DE] "𑇞"

I am rendering them as follows:

<g ref="#sharada-sign-siddham"/> <g ref="#sharada-section-mark-1"/>

I added them to the Special Characters list in the Transcription conventions of Saktumiva.

chchch commented 1 year ago

Hmm, we currently have sarada-ekam and sarada-siddhi, which I guess was to be consistent with newa-siddhi... should we just change everything to be consistent with Unicode block names?

wujastyk commented 1 year ago

I'd recommend going with the Unicode block names. I can see this process of adding required chars continuing, so it would be best not to make up lots of private names.

chchch commented 1 year ago

Ok, I added:

and in the file definitions.xsl, I marked as deprecated:

chakrabortydeepro commented 1 year ago

Thank you very much!

On Wed, Apr 12, 2023 at 2:33 PM chchch @.***> wrote:

Ok, I added:

  • sharada-ekam
  • sharada-sign-siddham
  • sharada-section-mark-1

and in the file definitions.xsl, I marked as deprecated:

  • sarada-ekam
  • sarada-siddhi

— Reply to this email directly, view it on GitHub https://github.com/chchch/upama/issues/10#issuecomment-1505892965, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIRWBI767GRTCYO7F2LD5DXA4GTJANCNFSM5AWIMAXA . You are receiving this because you were mentioned.Message ID: @.***>

-- Deepro Chakraborty (he/him) PhD candidate Department of History, Classics, and Religion University of Alberta

The University of Alberta acknowledges that we are located on ᐊᒥᐢᑿᒌᐚᐢᑲᐦᐃᑲᐣ (Amiskwacîwâskahikan) Treaty 6 territory, and respects the history, languages, and cultures of the First Nations, Métis, Inuit, and all First Peoples of Canada, whose presence continues to enrich our institution.

chakrabortydeepro commented 1 year ago

@chchch Hello Charles, could you also add sharada-section-mark-2 and sharada-continuation-sign? I added them in Transcription Conventions

chchch commented 1 year ago

done!

chakrabortydeepro commented 1 year ago

Thank you very much!

On Thu, Jun 1, 2023 at 2:44 PM chchch @.***> wrote:

done!

— Reply to this email directly, view it on GitHub https://github.com/chchch/upama/issues/10#issuecomment-1572752540, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIIRWBNHBXWZ764SYQB6D7TXJD5LHANCNFSM5AWIMAXA . You are receiving this because you were mentioned.Message ID: @.***>

-- Deepro Chakraborty (he/him) PhD candidate Department of History, Classics, and Religion University of Alberta

The University of Alberta acknowledges that we are located on ᐊᒥᐢᑿᒌᐚᐢᑲᐦᐃᑲᐣ (Amiskwacîwâskahikan) Treaty 6 territory, and respects the history, languages, and cultures of the First Nations, Métis, Inuit, and all First Peoples of Canada, whose presence continues to enrich our institution.