Open akprasad opened 10 months ago
Expanding on the error cases from above and from #103 --
Malformed input.
Lossy transliteration. This is not quite an error condition, but it is a condition the user might want to be aware of and handle explicitly. Examples:
RRi
and R^i
both map to ऋ
.rR
.Unknown input.
Mistaken input, e.g. confusing :
for the visarga. I think vidyut-lipi should not support this case since I don't believe in the robustness principle.
Another error class:
Unsupported input
So, a tentative list of annotations:
enum Quality {
// Exact text match, reversible with no loss of information (ignoring NFC/NFD).
Exact,
// Exact text match, reversible with no loss of phonetic information but may lose
// byte information (e.g. ITRANS RRi vs. R^i)
OneWay,
// Loses phonetic information.
Lossy,
// Malformed or garbled input.
Malformed,
// Not found in the input mapping.
Unknown,
// Found in the input mapping but not in the output mapping.
Unsupported,
}
I'm leaning more toward @shreevatsa's approach of returning a list of strings, as opposed to a single string and a list of spans. My reasoning:
Returning a spans is workable if the transliteration model is simple, but vidyut-lipi
(like Aksharamukha) now makes multiple passes to reshape the input and output text during transliteration.
Reshaping this text while also maintaining span offsets seems possible but very messy. That said, I haven't thought about it much.
I looked over all the kinds of errors related to input encountered/reported from the Sanskrit metres web app, and most of them fall into the categories already mentioned above:
:
for visarga, or S
for avagraha: common user errorḿ
at least partially)I don't believe in the robustness principle
I don't think this is a matter of belief unfortunately :) I remember Mark Pilgrim used to say “Postel's Law has no exceptions” (now can only find this online) — among other things, eventually more tolerant applications will win more users. If vidyut-lipi is intended to be a foundational library for other applications to use, then consider that some of those applications may want to go the "guess what the user intends" route (and offer a "are you sure? / did you mean…?" message or correct automatically). So I think it would be useful for vidyut-lipi to mark those parts of the input (like :
and S
) as "suspicious" at least, so that the application can decide whether to silently correct, or warn or educate the user, or fail or whatever.
Thanks for the extra context and examples!
I don't think this is a matter of belief unfortunately :)
If you'll permit me to digress :) --
I agree that a user-facing application should degrade gracefully and smooth over well-intentioned user guesses that don't conform to the spec. I think the application layer is the right place to handle this, not the library layer:
Unknown
level is enough to let callers decide how to handle this text according to the transliteration policy they want to follow. To me, Suspicious
is semantically almost the same as Unknown
.:
for visarga, S
for avagraha, then |
for danda, aa
or ii
in Harvard-Kyoto, 'Ri^' in ITRANS, etc. There is no end to how smart a program can be at guessing user intent, which muddies the promise we can offer users as a core library. I would rather make a clear promise.My general view of the robustness principle is consistent with the essay here, particularly section 3.
Yeah it wouldn't be good to do any sort of guesswork in the core library itself: as long as an application can achieve this using the library (either preprocessing the input, or postprocessing things like "Unknown" in the output, or passing to the library a list of characters that must be treated specially), it should be fine. Just saying it would be good to design the library so that such usage is possible.
Moving this discussion from #103, with some synthesis of comments from @deepestblue and @shreevatsa --
Context
I think that vidyut-lipi can become a foundational library for the Sanskrit ecosystem with a useful life of multiple decades. I think so primarily because the Rust ecosystem enables a nice "write once, run anywhere" workflow where we can focus on a single high-quality implementation then bind that implementation elsewhere (Python, Flutter, WASM, ...) as needed.
Foundational libraries should focus on the needs of power users, who expect precision and control. Accordingly, vidyut-lipi's error handling strategy should expose and model cases that a power user might care about.
Such cases include, but are not limited to:
:
instead of a visarga,s
instead of an avagraha.Prior work
The transliterators I know of generally implement a "best effort" strategy and return a single output string. The very best transliterators, like Aksharamukha, do likewise but also expose a variety of scheme-specific options that let users control how transliteration should proceed.
Since prior work doesn't extensively model error conditions, a natural question is: is error handling worthwhile at all, or is it a pedantic distraction?
I think it's worthwhile in some form (e.g. for malformed Grantha numerals), and I imagine that prior transliterators avoid explicit handling both because of time constraints and because they evolved to suit the needs of specific applications. (@virtualvinodh curious on your thoughts here.)
Approaches
Our prior discussion in #103 surfaced a variety of approaches to error handling, including:
As suggested by @shreevatsa on #33, I like having a two-tier approach:
Prior discussion
See the comments in #103, especially the comments by @deepestblue and @shreevatsa.
I'd like to start by documenting the error cases that might appear, which will inform a specific error-handling strategy.