Closed deepestblue closed 5 days ago
Thanks. It seems that Aksharamukha supports this and also supports a plain :
if it could not be a disambiguator in the current context. indic_transliteration
has no support for this, which is not surprising because it is based on a Sanscript port, and Sanscript's core algorithm is quite simple.
SaulabhyaJS also supports it, if you want to take a look. https://github.com/deepestblue/saulabhyaJS/blob/main/src/saulabhya.js and search for separator
@deepestblue requesting review of this basic spec:
a:i
, a:u
, k:h
, g:h
, c:h
, j:h
, ṭ:h
, ḍ:h
, t:h
, d:h
, p:h
, and b:h
(for Sanskrit; other non-Sanskrit languages might need support for other clusters.)1 and 3 sound right to me. On item 2, given Sanskrit doesn't use Latin punctuation traditionally and even in modern Sanskrit, people generally only use the comma, the question mark and the exclamation mark (because I guess the colon is rare even in English), I'd maybe propose instead to error out if Latin input contains the colon other than in these specified contexts?
Thanks, will proceed.
On erroring out: I'm undecided on the right error-handling policy for this library, since I expect that a lot of library input will be noisy in various ways (mixed content, large content that hasn't been proofread, etc.)
I am considering returning a Result
struct in this format, which should be readable to you even though it uses some Rust constructs:
struct Result {
text: String
errors: Vec<ErrorSpan>
}
struct ErrorSpan {
// Byte offsets in the input string. `usize` = platform-specific unsigned int, e.g. u64
start: usize
end: usize
error: ...
}
Edit: to be specific, I like that this struct returns a best-effort output while also annotating problematic regions of the input text.
Hmm, my 2 cents is that I'd expect a transliterator like this to be very conservative on input-handling; otherwise round-tripping gets messy, behaviour becomes fuzzy, etc.
I'd propose that un-proofread content isn't a valid scenario.
As for mixed-content, my thought is that the content could be marked up appropriately outside of invoking this library. Say in HTML, the markup can contain the lang
attribute, and the JS that would invoke vidyut would invoke it only for the appropriately marked up nodes.
Thanks for your 2c! I agree that conservatism is important and that it's important to flag errors clearly rather than muddling along and producing garbage (or worse, clean data with a few hidden dollops of garbage). Ideally, any transliterator output is lossless across a round trip.
At the same time, I also want to balance this principle with ergonomics. For example, I've encountered scenarios like the following either personally or through friends:
a user sees a Kannada web document they can't read (full site, forum comment, etc.) and wants to transliterate it to Devanagari.
a user has the raw data for a text from sanskritdocuments.org, GRETIL, etc. and wants to convert it to Telugu.
a user has a very long text file produced by Devanagari OCR and wants to convert it to ISO 15919 for easier proofreading.
As a user, I prefer that a transliterator return some useful result, especially if I want to spend at most a few seconds on the task. This is why I'm drawn to the approach I sketched above.
I think your mixed content approach will work well for structured documents like HTML, but if (for example) I'm copying a paragraph from a PDF, that structure won't be easily available.
Other potential approaches:
transliterate_strict
function that errors out earlyStrict
, Permissive
)Result<String>
(see std::result) and including the best-effort text in the error condition.(Responding also to https://github.com/ambuda-org/vidyut/pull/33#issuecomment-1907294132 )
I suggest having options for what the transliterator should do with unexpected text. (This is one of the things I'd hope for from a Rust transliterator…) Like {PASS_THROUGH, BEST_EFFORT, ERROR}, say. And/or correspondingly the result from the transliterator can be a sequence of chunks, each of them saying whether it's a "proper" transliterated result, or just a best-guess muddling through, or what.
There can be a "core" transliterator function that is very strict/conservative/pedantic and makes no choices / has no opinions of its own, all of them exposed through options that must be set.
Then there can be convenience wrapper functions for different use-cases (like the "I just want to get something useful" ones mentioned above, and the other use-case that @deepestblue and I are advocating for, of “If run my text through this transliterator, I'd want to be very sure that if it cannot round-trip back I'd know right away; I don't want to lose any information silently and find out days later”).
Possible examples of the options I mean:
colon_strategy
field of the options struct parameter?)Even if we expect very few people to use the transliterator "core" function directly, it would be a way of writing down explicitly all the choices that have been made in the convenience wrapper.
Ha, I missed that this discussion was about treating colon as a separator, which is relevant to two of my examples above :)
Also more concretely responding to comment https://github.com/ambuda-org/vidyut/issues/103#issuecomment-1909404083 above, rather than
struct Result {
text: String
errors: Vec<ErrorSpan>
}
struct ErrorSpan {
// Byte offsets in the input string. `usize` = platform-specific unsigned int, e.g. u64
start: usize
end: usize
error: ...
}
where the consumer has to manually match up the best-effort text with byte offsets, one of the things I'm proposing is something like (may not be working code, treat as pseudocode):
// result: Vec<ResultChunk>
struct ResultChunk {
text: String,
kind: ResultKind,
}
enum ResultKind {
Fine(String), // perfectly fine and unproblematic input for the source and destination scripts: well-understood and will round-trip cleanly
UnknownPassedThrough(String), // emoji, punctuation, etc: not part of the source and destination scripts, but just passed through
LikelyInputErrorSilentlyCorrected(String), // e.g. "s" in Devanagari corrected to avagraha
Separator, // goes with empty text, for input like कइ क्ह to avoid कै ख
Numeric(String, String), // e.g. ('1234', '१२३४'), so that the user can choose whether to transliterate digits or not.
UnrepresentableClosestMatch(String), // turning some of the different Tamil `L`s into ल and/or ळ
Dropped(String), // Accents and chandrabindu or whatever that we know what they are but don't know how to represent in the target script
// ...
}
or whatever, and the default convenience wrapper would just concatenate all the result chunks' text
while the "serious" user could assemble their own different result by looking into the ResultKind
s.
(Having these in the result may be even better than having to pre-specify some options e.g. whether to transliterate digits or not. A higher-level UI could say: “I transliterated your text for you, but note the following that I couldn't do anything with, or which you may want to change in your input…”)
(Doing all this may make it slower but despite the temptation of "it's in Rust, it must be fast" I believe hardly any applications are bottlenecked by transliteration speed in practice, and the appeal of Rust here for me is more in the types being able to represent all this.)
Transliterating from a script to itself (Devanagari to Devanagari, or IAST to IAST) would then be a way of finding all problematic stuff in it :-)
Anyway I'll stop the flood of comments here; aware that what I'm proposing is likely overengineering :-) The broader point is just a desire for a really conservative/pedantic/lossless transliterator which will never silently corrupt text no matter what the input is or how many rounds of transliteration in whatever directions it undergoes using the library.
Thank you for the wonderful discussion!
I think error handling is a large enough topic that it deserves its own issue, so I've created #105. Let's continue there so that this issue can remain focused on ISO 15919.
A couple of sorta related issues
aū
should transliterate to अऊ
agḥ
should transliterate to अग्ः
(I'm not sure there's a use-case for this specific example)
@deepestblue Thanks for the bug report! I was hoping to transliterate in as few input passes as possible, but I guess a basic convert to NFC
pass is worth avoiding the headaches elsewhere.
(Edit: fixed by calling to_nfc
first.)
Returning to the main issue (mainly taking notes for myself) --
I tried to hack around this behavior by enumerating all cases specifically and adding them to the token map. The block there was in how to support a:i
since a
is an implicit vowel on the consonant before. We could explicitly store all mappings कइ, खइ, etc. to get around this, but this feels gross and unprincipled.
Stepping back, the core logic seems to be something like:
if from.is_abugida() && to.is_alphabet() && to.has_separator() {
if prev.is_consonant() && cur.is_independent_vowel() {
// for a:i, a:u
output += separator;
} else if TODO {
output += separator.
}
}
Maybe we can combine these by hard-coding k:h
etc. then using custom code for the vowel-based separator.
Tentative test cases:
// positive
"a:i a:u"
"ka:i ka:u"
"k:ha g:ha c:ha j:ha ṭ:ha ḍ:ha t:ha d:ha p:ha b:ha"
"ḷ:ha"
// negative -- colon should be ignored
a:
ka:
k:
a:A
k:ta
Yep, this seems similar to the code in saulabhyaJS near https://github.com/deepestblue/saulabhyaJS/blob/main/src/saulabhya.js#L352
Returning to this now in anticipation of a vidyut-py
release. I'll start by adding basic support then expand over time.
Implemented locally. Here's the test case:
// Consonants
assert_two_way_pairwise(&[
(
Iso15919,
"k:ha g:ha c:ha j:ha ṭ:ha ḍ:ha t:ha d:ha p:ha b:ha",
),
(Slp1, "kha gha cha jha wha qha tha dha pha bha"),
(Devanagari, "क्ह ग्ह च्ह ज्ह ट्ह ड्ह त्ह द्ह प्ह ब्ह"),
(Kannada, "ಕ್ಹ ಗ್ಹ ಚ್ಹ ಜ್ಹ ಟ್ಹ ಡ್ಹ ತ್ಹ ದ್ಹ ಪ್ಹ ಬ್ಹ"),
]);
// Consonants with marks
assert_two_way_pairwise(&[
(
Iso15919,
"k:hā g:hā c:hā j:hā ṭ:hā ḍ:hā t:hā d:hā p:hā b:hā",
),
(Slp1, "khA ghA chA jhA whA qhA thA dhA phA bhA"),
(Devanagari, "क्हा ग्हा च्हा ज्हा ट्हा ड्हा त्हा द्हा प्हा ब्हा"),
(Kannada, "ಕ್ಹಾ ಗ್ಹಾ ಚ್ಹಾ ಜ್ಹಾ ಟ್ಹಾ ಡ್ಹಾ ತ್ಹಾ ದ್ಹಾ ಪ್ಹಾ ಬ್ಹಾ"),
]);
// Consonants with viramas
assert_two_way_pairwise(&[
(Iso15919, "k:h g:h c:h j:h ṭ:h ḍ:h t:h d:h p:h b:h"),
(Slp1, "kh gh ch jh wh qh th dh ph bh"),
(Devanagari, "क्ह् ग्ह् च्ह् ज्ह् ट्ह् ड्ह् त्ह् द्ह् प्ह् ब्ह्"),
(Kannada, "ಕ್ಹ್ ಗ್ಹ್ ಚ್ಹ್ ಜ್ಹ್ ಟ್ಹ್ ಡ್ಹ್ ತ್ಹ್ ದ್ಹ್ ಪ್ಹ್ ಬ್ಹ್"),
]);
// Vowels
assert_two_way_pairwise(&[
(Iso15919, "a:i a:u ka:i ka:u"),
(Slp1, "ai au kai kau"),
(Devanagari, "अइ अउ कइ कउ"),
(Kannada, "ಅಇ ಅಉ ಕಇ ಕಉ"),
]);
// Regular colons -- ignore
// TODO: what's the best policy for handling these?
assert_two_way_pairwise(&[
(Iso15919, "a: ka: k: a:ā k:ta"),
(Slp1, "a: ka: k: a:A k:ta"),
(Devanagari, "अ: क: क्: अ:आ क्:त"),
(Kannada, "ಅ: ಕ: ಕ್: ಅ:ಆ ಕ್:ತ"),
]);
For now, I avoided the issue of how to handle erroneous colons.
I'll close this issue once the changes are pushed.
Pushed.
One of the many corner cases ISO-15919 supports is using a
:
to disambiguate Latin letter clusters. Here're a couple examples that need support in vidyut-lipi../lipi -f devanagari -t iso19519 "अर्शइत्यादयः"
Expected:
arśa:ityādayaḥ
Actual:arśaityādayaḥ
./lipi -t devanagari -f iso19519 "arśa:ityādayaḥ"
Expected:अर्शइत्यादयः
Actual:अर्श:इत्यादयः
./lipi -f devanagari -t iso19519 "वाग्हरि"
Expected:vāg:hari
Actual:vāghari
./lipi -t devanagari -f iso19519 "vāg:hari"
Expected:वाग्हरि
Actual:वाग्:हरि