ambuda-org / vidyut

Infrastructure for Sanskrit software. For Python bindings, see `vidyut-py`.
48 stars 21 forks source link

iso 15924 codes (like the ICU lib for c++/java) for schemes #108

Closed mediabuff closed 4 months ago

mediabuff commented 4 months ago

Vidyut lipi uses custom enum type for script/scheme ids. It would be useful to change these enum values to ISO codes. This will help in interop with other langs. For Roman schemes, you could define numbers outside the range of ISO.

These codes are also used by Unicode ICU libs in c++, java and .net

https://en.wikipedia.org/wiki/ISO_15924

akprasad commented 4 months ago

Thank you for filing this issue!

To make sure I follow, are you suggesting something like this?

enum Scheme {
  Bengali = 325,
  Devanagari = 315,
  ...
}

Can you give a short example of how this helps interop? For example, something like cxx can generate C++ enums on our behalf, so a C++ caller could use the generated enum directly instead of passing codes.

akprasad commented 4 months ago

As a separate discussion, I've been thinking about adding some basic level of ISO support, perhaps starting with:

impl Scheme {
  pub fn to_iso_code(&self) -> &str { unimplemented!() }

  pub fn from_iso_code(code: impl AsRef<str>) -> Result<Self> { unimplemented!() }

  pub fn to_iso_number(&self) -> u32 { unimplemented!() }

  pub fn from_iso_number(num: u32) -> Result<Self> { unimplemented!() }
}

But I'll stop there to avoid flooding this thread.

mediabuff commented 4 months ago

Thanks for your prompt response. Firstly, congratulations and a large thank you for your efforts in Ambuda project.

To make sure I follow, are you suggesting something like this?

enum Scheme {
  Bengali = 325,
  Devanagari = 315,
  ...
}

Yes. Small correction on my request. I actually meant ICU script codes and ISO four-letter script tag and full names. I think Aksharamukha uses these 4-letter tags.

https://github.com/unicode-org/icu/blob/abcb80fd536b4505d8f74209aca71656a7aa54e7/icu4c/source/common/unicode/uscript.h#L9

Can you give a short example of how this helps interop? For example, something like cxx can generate C++ enums on our behalf, so a C++ caller could use the generated Enum directly instead of passing codes.

  1. Why reinvent codes when recognized state of art exists - ICU for Unicode is widely used in all platforms including core OS platforms.
  2. ICU is working on a reimplementation using Rust. This library has some fantastic algorithms for NLP. In fact, their transliteration is a rules based generic engine. Something that Vidyut lipi should plan forward
  3. On the interop front, I am working on Rust based REST api server based on Vidyut lipi - to be consumed by desktop clients written in C++, C# etc. These programs already use various NLP libs including ICU. The rest api would send vidyut script codes as numbers. With defacto ICU codes at client, I do not have to mapping layer either for codes or script names and tags.

Btw, I have reimplemented entire Ambuda in .Net core and Blazor (both browser and desktop client) , one day - when I get time - will try to make it open source.

akprasad commented 4 months ago

Thanks for the extra details!

one day - when I get time - will try to make it open source.

Very cool! Would love to see it whenever it's ready.

I think Aksharamukha uses these 4-letter tags.

Aksharamukha uses a mix of argument handling strategies and supports language codes, language names, and its own custom format. For details, see here.

In fact, their transliteration is a rules based generic engine.

Do you mean ICU transforms? They do look very interesting!

One constraint I want to follow with vidyut-lipi is to have a very light WebAssembly build that could be used client side, ideally <100KB gzipped (but the smaller the better). So if we can find a lightweight way to either use or rewrite that system, I'm open to it. Otherwise I think it's too heavy. (vidyut-lipi avoids using rexeges for the same reason.)

With defacto ICU codes at client, I do not have to mapping layer either for codes or script names and tags.

I find this point convincing, especially since these are standard codes used across a wide ecosystem of tools.


Given that we will support for ISO/ICU codes and tags, the next item is how to do so. Here's my current view, but all of this is open for discussion:

For enum names, I am inclined to keep the names as-is:

For enum values, I am inclined to not have default values (i.e. I want to leave the discriminant unspecified):

So for these reasons I'm inclined toward having helper functions on impl Scheme. This would still give you a clean API while preserving some flexibility in the crate.

That said, this is just my perspective and I might be missing something.

mediabuff commented 4 months ago

This may be of interest. The header file in my ICU based transliteration for Indic scripts. I have used ICU codes where applicable and for Roman (they call it Latin!!) schemes a range above ICU starting at 2000

Attached UScript.h UScripts.h.txt

pragma once

using namespace System;

namespace Indic::Lekhya::ICU::Net {

// Modelled after https://github.com/NightOwl888/ICU4N/blob/37df14cb6335354bb354be195d72bc9a7e857809/src/ICU4N/Globalization/UScript.cs#L6

public enum class UScriptCode { Common = 0, Inherited = 1, Arabic = 2, Armenian = 3,

Lao = 24,
    Latin = 25,
    Malayalam = 26,
   NagMundari = 199,
Ucas = CanadianAboriginal,
Sindhi = Khudawadi,
Mandaean = Mandaic,
Meroitic = MeroiticHieroglyphs,
Phonetic_Pollard = Miao,

... ahom = Ahom, assamese = Latin + 2000, balinese = Balinese, bengali = Bengali, bhaiksuki = Bhaiksuki, brahmi = Brahmi, brahmi_tamil = Latin + 2001, burmese = Latin + 2002,

... warang_citi = WarangCiti, zanbazar_square = Latin + 2019, Latin_avestan = Avestan, Latin_baraha = Latin + 1000, Latin_cyrillic = Cyrillic, Latin_hk = Latin + 1001, Latin_hk_dravidian = Latin + 1002, Latin_iast = Latin + 1003, Latin_iast_iso_m = Latin + 1004, Latin_iso = Latin, Latin_iso_vedic = Latin + 1005, Latin_itrans = Latin + 1006, Latin_itrans_dravidian = Latin + 1007, Latin_itrans_lowercase = Latin + 1008, Latin_kolkata_v2 = Latin + 1009, Latin_mahajani = Mahajani, Latin_multani = Multani, Latin_optitrans = Latin + 1010, Latin_optitrans_dravidian = Latin + 1011, Latin_persian_old = Latin + 1012, Latin_slp1 = Latin + 1013, Latin_slp1_accented = Latin + 1014, Latin_titus = Latin + 1015, Latin_velthuis = Latin + 1016, Latin_wx = Latin + 1017, };

`

akprasad commented 4 months ago

Thanks for the example!

To clarify, do the numeric values in UScriptCode have any meaning or standard outside of ICU4N, or are they just the convention of a single program? That is, why not use script tags everywhere and ignore these numeric codes altogether?

mediabuff commented 4 months ago

1) The numerics have no meaning outside of ICU ecosystem. They are same across ICU c++, java and .net and other bindings. Albeit a mini/custom versions of ICU is built into all major OSs - Windows, ios etc. Their native apis might use the same code (have not tested it). I know Windows has direct binary access/ABI to their built-in ICU. Tags and names are mostly ISO.

2) Enum names (aka tags) are compile time only in c++ (unlike c# or other languages that carry runtime metadata). Thus any runtime interop in compiled languages like c++ enums are numeric. Takes bit of effort in c++ to get enum names compiled into the binary (some kind of macro expansion helper). Bit of a pain.

3) Also note the APIs public ref class UScripts abstract sealed { public: static UScriptCode scriptForCodePoint(UInt32 codePoint); static String^ scriptForCodePoint2(UInt32 codePoint); static UScriptCode GetScript(int codepoint); static String^ GetScriptName(int codepoint); static String^ GetName(UScriptCode scriptCode); static String^ GetShortName(UScriptCode scriptCode); static UScriptCode GetCodeFromName(String^ name); };

4) Dual APIs - raw numeric script codes and typed enums. Internally the enums are just cast to int. Codepoints are Unicode scalar values.

5) I also a use a bunch of integer range algorithms - for Unicode ranges - like set, union of code points

akprasad commented 4 months ago

Thanks for the extra context.

Given all of this, I propose this API:

pub fn from_icu_code(code: u32) -> Result<Self> { todo!() }
pub fn to_icu_code(&self) -> Option<u32> { { todo!() }

Can you prepare a PR? To save some boilerplate, I've started off with this macro:

macro_rules! icu_codes {
    ($( $variant:ident => $code:literal ),* $(,)?) => {
        impl Scheme {
            /// Parses the given ICU `code` and returns the best match.
            ///
            /// Codes are defined according to the [`UScript.cs`][codes] file in ICU4N.
            ///
            /// [codes]: https://github.com/NightOwl888/ICU4N/blob/tree/src/ICU4N/Globalization/UScript.cs
            pub fn from_icu_code(code: u32) -> Result<Self> {
                let ret = match code {
                    $(
                        $code => Scheme::$variant,
                    )*
                    _ => return Err(LipiError::ParseError),
                };
                Ok(ret)
            }

            /// Converts the given scheme to its ICU code, if one exists.
            pub fn to_icu_code(&self) -> Option<u32> {
                let ret = match self {
                    $(
                        Scheme::$variant => $code,
                    )*
                    _ => return None,
                };
                Some(ret)
            }
        }
    };
}

Then the part for you to focus on is defining the codes:

icu_codes!(
    Devanagari => 10,
    // *** add others here ***
);

Result and LipiError are from a new errors.rs module:

use std::fmt;

/// A wrapper for `std::fmt::Result`.
pub type Result<T> = std::result::Result<T, LipiError>;

/// Models the error states of `vidyut-lipi`.
#[derive(Copy, Clone, Debug)]
pub enum LipiError {
    /// Could not parse an input value.
    ParseError,
}

impl fmt::Display for LipiError {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        use LipiError::*;

        match self {
            ParseError => write!(f, "parse error"),
        }
    }
}
mediabuff commented 4 months ago

Thanks for your quick action on this. This will take some time for me as I am a newbie to Rust ( Ambuda is motivating me to learn fast). Additionally, I don't have the workflow setup with Github as a contributor let alone on a Rust project. In due course, I will certainly be a contributor.

akprasad commented 4 months ago

sounds good! In that case I'll get something ready. It'll be conservative, but it's better than nothing, and we can iterate on it.

akprasad commented 4 months ago

Hm, your reference file seems to be missing many scripts, including Dogra and Nandinagari. How should these be defined?

mediabuff commented 4 months ago

Did you look at the attachment USscript.h.txt ? It has Dogra = 178. It should be above in one of my replies.

akprasad commented 4 months ago

Thanks, I saw it earlier but forgot it when diving back in. I see many of the values there are also corroborated by the ICU4X implementation here. I'll follow what ICU4X does and aim to avoid custom codes.

akprasad commented 4 months ago

Sorry, two more questions:

  1. It looks like ICU4N already supplies functions like GetCode, GetCodeFromName, etc. for turning an ISO 4-letter code into a numeric code. The ICU4X Rust port also has a similar function.

    Are these functions not workable? I want to make sure that a separate icu method is a meaningful addition to the methods for ISO codes that I'm working on. If these ISO methods can work instead, it's better for the API to keep the surface small.

  2. Does your API depend on using separate codes for HK, IAST, etc.? This is not what ICU does (I think it would map these all to 25, the code for Latin), so if you depend on this behavior, I think it's best handled at the application level (i.e. your code) and not in this library. But I'm happy to provide boilerplate so you don't have to write it yourself here.

mediabuff commented 4 months ago

ICUN and ICU4X can work for a subset of the scripts. The challenge is for numerous Latin/Roman schemes for Indic scripts. ICUN/ICUX assume ISO transliteration - that is the only transliteration their libs support. That is one of reasons I extended their library with my own in my c++/.net implementation

Does your API depend on using separate codes for HK, IAST, etc. Yes. For us to work with Indic-transliteration-scripts aware application. Thus I extended the ICU numeric range - albeit custom and my own ids.

I thinks it's ok not to have in the Vidyut library as long it maps the same codes for ICU recognized scripts in the core APIs

akprasad commented 4 months ago

I thinks it's ok not to have in the Vidyut library as long it maps the same codes for ICU recognized scripts in the core APIs

This is implemented in the latest local build. Pushing soon.

akprasad commented 4 months ago

Pushed. Scheme now has the following methods:

fn iso_15924_code() -> &str
fn iso_15924_numeric_code() -> u16
fn icu_numeric_code() -> u16

Thanks for filing this issue! Please open a new one if these methods are insufficient.