ambuda-org / vidyut

Infrastructure for Sanskrit software. For Python bindings, see `vidyut-py`.
53 stars 21 forks source link

Decide on and implement an error-handling policy #105

Open akprasad opened 10 months ago

akprasad commented 10 months ago

Moving this discussion from #103, with some synthesis of comments from @deepestblue and @shreevatsa --

Context

I think that vidyut-lipi can become a foundational library for the Sanskrit ecosystem with a useful life of multiple decades. I think so primarily because the Rust ecosystem enables a nice "write once, run anywhere" workflow where we can focus on a single high-quality implementation then bind that implementation elsewhere (Python, Flutter, WASM, ...) as needed.

Foundational libraries should focus on the needs of power users, who expect precision and control. Accordingly, vidyut-lipi's error handling strategy should expose and model cases that a power user might care about.

Such cases include, but are not limited to:

Prior work

The transliterators I know of generally implement a "best effort" strategy and return a single output string. The very best transliterators, like Aksharamukha, do likewise but also expose a variety of scheme-specific options that let users control how transliteration should proceed.

Since prior work doesn't extensively model error conditions, a natural question is: is error handling worthwhile at all, or is it a pedantic distraction?

I think it's worthwhile in some form (e.g. for malformed Grantha numerals), and I imagine that prior transliterators avoid explicit handling both because of time constraints and because they evolved to suit the needs of specific applications. (@virtualvinodh curious on your thoughts here.)

Approaches

Our prior discussion in #103 surfaced a variety of approaches to error handling, including:

As suggested by @shreevatsa on #33, I like having a two-tier approach:

Prior discussion

See the comments in #103, especially the comments by @deepestblue and @shreevatsa.

I'd like to start by documenting the error cases that might appear, which will inform a specific error-handling strategy.

akprasad commented 10 months ago

Expanding on the error cases from above and from #103 --

akprasad commented 9 months ago

Another error class:

So, a tentative list of annotations:

enum Quality {
  // Exact text match, reversible with no loss of information (ignoring NFC/NFD).
  Exact,
  // Exact text match, reversible with no loss of phonetic information but may lose
  // byte information (e.g. ITRANS RRi vs. R^i)
  OneWay,
  // Loses phonetic information.
  Lossy,
  // Malformed or garbled input.
  Malformed,
  // Not found in the input mapping.
  Unknown,
  // Found in the input mapping but not in the output mapping.
  Unsupported,
}

I'm leaning more toward @shreevatsa's approach of returning a list of strings, as opposed to a single string and a list of spans. My reasoning:

shreevatsa commented 9 months ago

I looked over all the kinds of errors related to input encountered/reported from the Sanskrit metres web app, and most of them fall into the categories already mentioned above:

I don't believe in the robustness principle

I don't think this is a matter of belief unfortunately :) I remember Mark Pilgrim used to say “Postel's Law has no exceptions” (now can only find this online) — among other things, eventually more tolerant applications will win more users. If vidyut-lipi is intended to be a foundational library for other applications to use, then consider that some of those applications may want to go the "guess what the user intends" route (and offer a "are you sure? / did you mean…?" message or correct automatically). So I think it would be useful for vidyut-lipi to mark those parts of the input (like : and S) as "suspicious" at least, so that the application can decide whether to silently correct, or warn or educate the user, or fail or whatever.

akprasad commented 9 months ago

Thanks for the extra context and examples!

I don't think this is a matter of belief unfortunately :)

If you'll permit me to digress :) --

I agree that a user-facing application should degrade gracefully and smooth over well-intentioned user guesses that don't conform to the spec. I think the application layer is the right place to handle this, not the library layer:

My general view of the robustness principle is consistent with the essay here, particularly section 3.

shreevatsa commented 9 months ago

Yeah it wouldn't be good to do any sort of guesswork in the core library itself: as long as an application can achieve this using the library (either preprocessing the input, or postprocessing things like "Unknown" in the output, or passing to the library a list of characters that must be treated specially), it should be fine. Just saying it would be good to design the library so that such usage is possible.