ambuda-org / vidyut

Infrastructure for Sanskrit software. For Python bindings, see `vidyut-py`.
48 stars 21 forks source link

Joining Visarga and Svara #88

Closed skmnktl closed 4 months ago

skmnktl commented 6 months ago

इ॒षे त्वो॒,-र्जे त्वा॑, वा॒यव॑ः स्थोपा॒यव॑ः स्थ, दे॒व व सवत प्रार्प॑यतु॒ श्रेष्ठ॑तमाय॒ कर्म॑ण॒, आ प्या॑यदध्वमघ्निया देवभा॒ग-मूर्ज॑स्वती॒ः पय॑स्वतीः प्र॒जवत-रनमी॒वा अ॑य॒क्ष्मा मा व॑ः स्ते॒न ई॑शत॒ माऽघश(ग्म्)॑सो, रु॒द्रस्य॑ हे॒तिः परि॑ वो वृणक्तु, ध्रवा अ॒समिन् गोप॑तौ स्यात ब॒ह्वीर्, यज॑मानस्य प॒शून् पा॑हि ॥ १ (इ॒षे - त्रिच॑त्वारि(ग्म्)शत् )

The accents mess with the joining on the visargas. If I remember right, we need to make sure all the accents precede the visarga (fixing the input if they don't) and then add a zero-width-joiner between the svara and the visarga.

akprasad commented 6 months ago

Thanks! Can you give me a correct example as well? I tried this in JavaScript:

// a, svarita, ZWJ, visarga
x = "\u0905\u0951\u200d\u0903" 

And the result is अ॑‍ः which seems incorrect still.

akprasad commented 6 months ago

@skmnktl following up here

skmnktl commented 6 months ago

So "aqH" renders as "अ॒ः" on vidyut-lipi, but aksharamukha renders it as "अः॒". I'm not at my computer, but I thought I'd answer your question for now. I can decompose that into unicode ids later.

skmnktl commented 6 months ago

Vidyut produces:

U+0905 : DEVANAGARI LETTER A
U+0952 : DEVANAGARI STRESS SIGN ANUDATTA {Vedic tone anudatta}
U+0903 : DEVANAGARI SIGN VISARGA

Aksharamukha does:

U+0905 : DEVANAGARI LETTER A
U+0903 : DEVANAGARI SIGN VISARGA
U+0952 : DEVANAGARI STRESS SIGN ANUDATTA {Vedic tone anudatta}

Seems the issue is with the order. Visarga combines with the accent but the accents only combine with vowels. That said, when doing indic, you'd need to reverse the order back though.

akprasad commented 6 months ago

Thanks, this is helpful.

That said, when doing indic, you'd need to reverse the order back though.

What do you mean by this? I understand "For Devanagari and other Indic scripts, accents should come after the vowel and before the visasrga". What should be done for romanizations, if anything?

skmnktl commented 6 months ago

Actually I think it should be:

For indic scripts: vowel+visarga+accent For roman scripts: vowel+accent+visarga

I just meant above that you'd have to invert the visarga and accent going from roman <=> indic and you'd keep the order the same when going roman => roman or indic => indic.

akprasad commented 6 months ago

ah, I see! Thanks, this is clear enough for me to start preparing a fix.

akprasad commented 4 months ago

I'm working on this now.

akprasad commented 4 months ago

This is fixed locally. Pushing soon.

akprasad commented 4 months ago

Pushed and deployed to the demo.