ambuda-org / vidyut

Infrastructure for Sanskrit software. For Python bindings, see `vidyut-py`.
48 stars 21 forks source link

vidyut-lipi needs to handle the colon separator in ISO-15919 #103

Open deepestblue opened 6 months ago

deepestblue commented 6 months ago

One of the many corner cases ISO-15919 supports is using a : to disambiguate Latin letter clusters. Here're a couple examples that need support in vidyut-lipi.

./lipi -f devanagari -t iso19519 "अर्शइत्यादयः"

Expected: arśa:ityādayaḥ Actual: arśaityādayaḥ

./lipi -t devanagari -f iso19519 "arśa:ityādayaḥ" Expected: अर्शइत्यादयः Actual: अर्श:इत्यादयः

./lipi -f devanagari -t iso19519 "वाग्हरि" Expected: vāg:hari Actual: vāghari

./lipi -t devanagari -f iso19519 "vāg:hari" Expected: वाग्हरि Actual: वाग्:हरि

akprasad commented 6 months ago

Thanks. It seems that Aksharamukha supports this and also supports a plain : if it could not be a disambiguator in the current context. indic_transliteration has no support for this, which is not surprising because it is based on a Sanscript port, and Sanscript's core algorithm is quite simple.

deepestblue commented 6 months ago

SaulabhyaJS also supports it, if you want to take a look. https://github.com/deepestblue/saulabhyaJS/blob/main/src/saulabhya.js and search for separator

akprasad commented 5 months ago

@deepestblue requesting review of this basic spec:

deepestblue commented 5 months ago

1 and 3 sound right to me. On item 2, given Sanskrit doesn't use Latin punctuation traditionally and even in modern Sanskrit, people generally only use the comma, the question mark and the exclamation mark (because I guess the colon is rare even in English), I'd maybe propose instead to error out if Latin input contains the colon other than in these specified contexts?

akprasad commented 5 months ago

Thanks, will proceed.

On erroring out: I'm undecided on the right error-handling policy for this library, since I expect that a lot of library input will be noisy in various ways (mixed content, large content that hasn't been proofread, etc.)

I am considering returning a Result struct in this format, which should be readable to you even though it uses some Rust constructs:

struct Result {
  text: String
  errors: Vec<ErrorSpan>
}

struct ErrorSpan {
  // Byte offsets in the input string. `usize` = platform-specific unsigned int, e.g. u64
  start: usize
  end: usize
  error: ...
}

Edit: to be specific, I like that this struct returns a best-effort output while also annotating problematic regions of the input text.

deepestblue commented 5 months ago

Hmm, my 2 cents is that I'd expect a transliterator like this to be very conservative on input-handling; otherwise round-tripping gets messy, behaviour becomes fuzzy, etc.

I'd propose that un-proofread content isn't a valid scenario.

As for mixed-content, my thought is that the content could be marked up appropriately outside of invoking this library. Say in HTML, the markup can contain the lang attribute, and the JS that would invoke vidyut would invoke it only for the appropriately marked up nodes.

akprasad commented 5 months ago

Thanks for your 2c! I agree that conservatism is important and that it's important to flag errors clearly rather than muddling along and producing garbage (or worse, clean data with a few hidden dollops of garbage). Ideally, any transliterator output is lossless across a round trip.

At the same time, I also want to balance this principle with ergonomics. For example, I've encountered scenarios like the following either personally or through friends:

As a user, I prefer that a transliterator return some useful result, especially if I want to spend at most a few seconds on the task. This is why I'm drawn to the approach I sketched above.

I think your mixed content approach will work well for structured documents like HTML, but if (for example) I'm copying a paragraph from a PDF, that structure won't be easily available.

Other potential approaches:

shreevatsa commented 5 months ago

(Responding also to https://github.com/ambuda-org/vidyut/pull/33#issuecomment-1907294132 )

I suggest having options for what the transliterator should do with unexpected text. (This is one of the things I'd hope for from a Rust transliterator…) Like {PASS_THROUGH, BEST_EFFORT, ERROR}, say. And/or correspondingly the result from the transliterator can be a sequence of chunks, each of them saying whether it's a "proper" transliterated result, or just a best-guess muddling through, or what.

shreevatsa commented 5 months ago

Possible examples of the options I mean:

Even if we expect very few people to use the transliterator "core" function directly, it would be a way of writing down explicitly all the choices that have been made in the convenience wrapper.

shreevatsa commented 5 months ago

Ha, I missed that this discussion was about treating colon as a separator, which is relevant to two of my examples above :)

Also more concretely responding to comment https://github.com/ambuda-org/vidyut/issues/103#issuecomment-1909404083 above, rather than

struct Result {
  text: String
  errors: Vec<ErrorSpan>
}

struct ErrorSpan {
  // Byte offsets in the input string. `usize` = platform-specific unsigned int, e.g. u64
  start: usize
  end: usize
  error: ...
}

where the consumer has to manually match up the best-effort text with byte offsets, one of the things I'm proposing is something like (may not be working code, treat as pseudocode):

// result: Vec<ResultChunk>

struct ResultChunk {
    text: String,
    kind: ResultKind,
}

enum ResultKind {
    Fine(String), // perfectly fine and unproblematic input for the source and destination scripts: well-understood and will round-trip cleanly
    UnknownPassedThrough(String), // emoji, punctuation, etc: not part of the source and destination scripts, but just passed through
    LikelyInputErrorSilentlyCorrected(String), // e.g. "s" in Devanagari corrected to avagraha
    Separator, // goes with empty text, for input like कइ क्ह to avoid कै ख
    Numeric(String, String), // e.g. ('1234', '१२३४'), so that the user can choose whether to transliterate digits or not.
    UnrepresentableClosestMatch(String), // turning some of the different Tamil `L`s into ल and/or ळ
    Dropped(String), // Accents and chandrabindu or whatever that we know what they are but don't know how to represent in the target script
   // ...
}

or whatever, and the default convenience wrapper would just concatenate all the result chunks' text while the "serious" user could assemble their own different result by looking into the ResultKinds.

(Having these in the result may be even better than having to pre-specify some options e.g. whether to transliterate digits or not. A higher-level UI could say: “I transliterated your text for you, but note the following that I couldn't do anything with, or which you may want to change in your input…”)

(Doing all this may make it slower but despite the temptation of "it's in Rust, it must be fast" I believe hardly any applications are bottlenecked by transliteration speed in practice, and the appeal of Rust here for me is more in the types being able to represent all this.)

shreevatsa commented 5 months ago

Transliterating from a script to itself (Devanagari to Devanagari, or IAST to IAST) would then be a way of finding all problematic stuff in it :-)

Anyway I'll stop the flood of comments here; aware that what I'm proposing is likely overengineering :-) The broader point is just a desire for a really conservative/pedantic/lossless transliterator which will never silently corrupt text no matter what the input is or how many rounds of transliteration in whatever directions it undergoes using the library.

akprasad commented 5 months ago

Thank you for the wonderful discussion!

I think error handling is a large enough topic that it deserves its own issue, so I've created #105. Let's continue there so that this issue can remain focused on ISO 15919.

deepestblue commented 4 months ago

A couple of sorta related issues

aū should transliterate to अऊ

agḥ should transliterate to अग्ः (I'm not sure there's a use-case for this specific example)

akprasad commented 4 months ago

@deepestblue Thanks for the bug report! I was hoping to transliterate in as few input passes as possible, but I guess a basic convert to NFC pass is worth avoiding the headaches elsewhere.

(Edit: fixed by calling to_nfc first.)

akprasad commented 4 months ago

Returning to the main issue (mainly taking notes for myself) --

I tried to hack around this behavior by enumerating all cases specifically and adding them to the token map. The block there was in how to support a:i since a is an implicit vowel on the consonant before. We could explicitly store all mappings कइ, खइ, etc. to get around this, but this feels gross and unprincipled.

Stepping back, the core logic seems to be something like:

if from.is_abugida() && to.is_alphabet() && to.has_separator() {
  if prev.is_consonant() && cur.is_independent_vowel() {
    // for a:i, a:u
    output += separator;
  } else if TODO {
    output += separator.
  }
}

Maybe we can combine these by hard-coding k:h etc. then using custom code for the vowel-based separator.

Tentative test cases:

// positive
"a:i a:u"
"ka:i ka:u"
"k:ha g:ha c:ha j:ha ṭ:ha ḍ:ha t:ha d:ha p:ha b:ha"
"ḷ:ha"

// negative -- colon should be ignored
a:
ka:
k:
a:A
k:ta
deepestblue commented 4 months ago

Yep, this seems similar to the code in saulabhyaJS near https://github.com/deepestblue/saulabhyaJS/blob/main/src/saulabhya.js#L352