Decide on and implement an error-handling policy

akprasad commented 5 months ago

Moving this discussion from #103, with some synthesis of comments from @deepestblue and @shreevatsa --

Context

I think that vidyut-lipi can become a foundational library for the Sanskrit ecosystem with a useful life of multiple decades. I think so primarily because the Rust ecosystem enables a nice "write once, run anywhere" workflow where we can focus on a single high-quality implementation then bind that implementation elsewhere (Python, Flutter, WASM, ...) as needed.

Foundational libraries should focus on the needs of power users, who expect precision and control. Accordingly, vidyut-lipi's error handling strategy should expose and model cases that a power user might care about.

Such cases include, but are not limited to:

Lossy transliteration that loses information. Example: Bengali ba which can model Devanagari ba/va.
Mistaken input that is confused for something else. Examples: : instead of a visarga, s instead of an avagraha.
Malformed input that violates our spec. Examples: malformed Grantha numbers, input text that is not in a Unicode standard form (e.g. combining marks out of order).
Unknown input that cannot be transliterated at all. Examples: text in a different scheme, emojis, some punctuation.

Prior work

The transliterators I know of generally implement a "best effort" strategy and return a single output string. The very best transliterators, like Aksharamukha, do likewise but also expose a variety of scheme-specific options that let users control how transliteration should proceed.

Since prior work doesn't extensively model error conditions, a natural question is: is error handling worthwhile at all, or is it a pedantic distraction?

I think it's worthwhile in some form (e.g. for malformed Grantha numerals), and I imagine that prior transliterators avoid explicit handling both because of time constraints and because they evolved to suit the needs of specific applications. (@virtualvinodh curious on your thoughts here.)

Approaches

Our prior discussion in #103 surfaced a variety of approaches to error handling, including:

best-effort
failing outright on bad input
annotating spans of the input/output for their error status

As suggested by @shreevatsa on #33, I like having a two-tier approach:

a looser high-level API that just gets the user something to work with;
a stricter low-level API that gives the user precise control and insight into the output.

Prior discussion

See the comments in #103, especially the comments by @deepestblue and @shreevatsa.

I'd like to start by documenting the error cases that might appear, which will inform a specific error-handling strategy.

akprasad commented 5 months ago

Expanding on the error cases from above and from #103 --

Malformed input.
- Grantha does not use decimal notation, which means some Grantha spans might be malformed. Example: ௧௦.
Lossy transliteration. This is not quite an error condition, but it is a condition the user might want to be aware of and handle explicitly. Examples:
- Devanagari ब and व both map to Bengali ব.
- ITRANS RRi and R^i both map to ऋ.
- Devanagari रृ and र्ऋ both map to HK rR.
Unknown input.
- If transliterating mixed Tamil/Grantha content from Grantha, the user might wish to know which parts of the text are Tamil.
Mistaken input, e.g. confusing : for the visarga. I think vidyut-lipi should not support this case since I don't believe in the robustness principle.

akprasad commented 4 months ago

Another error class:

Unsupported input
- Devanagari might use e.g. a punctuation mark that has no equivalent in the target script.

So, a tentative list of annotations:

enum Quality {
  // Exact text match, reversible with no loss of information (ignoring NFC/NFD).
  Exact,
  // Exact text match, reversible with no loss of phonetic information but may lose
  // byte information (e.g. ITRANS RRi vs. R^i)
  OneWay,
  // Loses phonetic information.
  Lossy,
  // Malformed or garbled input.
  Malformed,
  // Not found in the input mapping.
  Unknown,
  // Found in the input mapping but not in the output mapping.
  Unsupported,
}

I'm leaning more toward @shreevatsa's approach of returning a list of strings, as opposed to a single string and a list of spans. My reasoning:

Returning a spans is workable if the transliteration model is simple, but vidyut-lipi (like Aksharamukha) now makes multiple passes to reshape the input and output text during transliteration.
Reshaping this text while also maintaining span offsets seems possible but very messy. That said, I haven't thought about it much.

shreevatsa commented 4 months ago

I looked over all the kinds of errors related to input encountered/reported from the Sanskrit metres web app, and most of them fall into the categories already mentioned above:

Unknown characters in input (possibly already supported): ळ, U+0901 DEVANAGARI SIGN CANDRABINDU, ॐ, ISO 15919 ē ō r̥ r̥̄ l̥ l̥̄ ṁ, UPADHMANIYA and Jihvamuliya (treat as visarga?), फ़ (and क़ख़ग़ज़फ़ड़ढ़)
Devanagari input that happens to have short e/o for long e/o, or : for visarga, or S for avagraha: common user error
ZWNJ/ZWSP in the input, Unicode NFC normalization (combining characters), or NFD (to be able to handle unknown composed characters like ḿ at least partially)
When certain characters in the input are not recognized, warn more prominently.

I don't believe in the robustness principle

I don't think this is a matter of belief unfortunately :) I remember Mark Pilgrim used to say “Postel's Law has no exceptions” (now can only find this online) — among other things, eventually more tolerant applications will win more users. If vidyut-lipi is intended to be a foundational library for other applications to use, then consider that some of those applications may want to go the "guess what the user intends" route (and offer a "are you sure? / did you mean…?" message or correct automatically). So I think it would be useful for vidyut-lipi to mark those parts of the input (like : and S) as "suspicious" at least, so that the application can decide whether to silently correct, or warn or educate the user, or fail or whatever.

akprasad commented 4 months ago

Thanks for the extra context and examples!

I don't think this is a matter of belief unfortunately :)

If you'll permit me to digress :) --

I agree that a user-facing application should degrade gracefully and smooth over well-intentioned user guesses that don't conform to the spec. I think the application layer is the right place to handle this, not the library layer:

I think annotating this text at the more generic Unknown level is enough to let callers decide how to handle this text according to the transliteration policy they want to follow. To me, Suspicious is semantically almost the same as Unknown.
Best-effort work that guesses at the user's intent grows indefinitely: : for visarga, S for avagraha, then | for danda, aa or ii in Harvard-Kyoto, 'Ri^' in ITRANS, etc. There is no end to how smart a program can be at guessing user intent, which muddies the promise we can offer users as a core library. I would rather make a clear promise.
Over a long enough time horizon, the implementation becomes the spec for the wider ecosystem and complicates efforts to standardize. I don't want a core library to encourage long-term deviations from a clear standard.

My general view of the robustness principle is consistent with the essay here, particularly section 3.

shreevatsa commented 4 months ago

Yeah it wouldn't be good to do any sort of guesswork in the core library itself: as long as an application can achieve this using the library (either preprocessing the input, or postprocessing things like "Unknown" in the output, or passing to the library a list of characters that must be treated specially), it should be fine. Just saying it would be good to design the library so that such usage is possible.

ambuda-org / vidyut