Rust-based icu_collator bindings

foxbenjaminfox commented 1 year ago

As a first step towards #9, I've added a very first draft of what rust bindings might look like. I've implemented a nif function along the lines of what is described there, that so far duplicates what is currently implemented with the C bindings.

I understand from the description in #9 that this might not entirely be what we want:. I understand that you envision exposing all the CollatorOptions to Elixir as they are, rather than simply switching between primary and the default tertiary strength based on an :insensitive key. Is this correct? Be that as it may, as a very first step I wanted to replicate the existing setup in order to establish parity. We can then easily change what options it accepts to better align icu_collator, or however you think best.

For now, I've put the internal API in an intended-to-be-private module named Cldr.Collation.Nif. You can rearrange the code on the elixir side as you please, but it's probably best that the private NIF API get an internal module of its own, whatever name we might choose to give it.

With what is implemented so far, we can match the functionality found in the existing tests:

assert Cldr.Collation.Nif.sort("en", ["AAAA", "AAAa"], %{casing: :insensitive}) == ["AAAA", "AAAa"]
assert Cldr.Collation.Nif.sort("en", ["AAAA", "AAAa"], %{casing: :sensitive}) == ["AAAa", "AAAA"]

That said, I've punted on the main challenge here: loading the CLDR data. Rust's icu_provider gives a plethora of ways to create a DataProvider; at runtime, at compile time, from all sorts of formats. What do we want to do here? I'm not sure what the C version does, or if we want to match it.

The current implementation uses the icu_testdata crate, which contains stable data intended for use in testing, but isn't intended for use outside of that.

kipcole9 commented 1 year ago

Wow, that's amazing! And I've no argument with anything you said above (duplicate current functionality first, private NIF module). Yes, most definitely I would like to expose the collator options somewhere down the track.

The data loading part surprised me - I guess because I didn't think it thought. The current C-based version only implements the DUCET so there is no data to be loaded. And for some reason I though icu-collator was self-contained on that front but clearly not.

I'd welcome any thoughts you have, and I'll do some more reading. Getting into another, different, data management domain wouldn't be much fun for your and me. And it wouldn't be fun for anyone using the library either.

foxbenjaminfox commented 1 year ago

I was kind of hoping you'd already have an answer for how you want to handle this question, given that you must have already settled on something or another for all the existing Elixir CLDR libraries. This hasn't really got anything to do with collation per se—all different types of localization have this issue. If some obscure language uses some unusual pluralization rule, then while on the one hand you want to support that language (after all, that's what CLDR is for), but on the other hand you might not want to require every single use of the library to bring along the relatively large amounts of data required to cover every possible case for every single language. On the other hand, that's probably overstating it—I see that you do bundle pluralization rules in with the library. So maybe that's fine?

The Rust icu crate (of which the icu_collator crate is part) takes the approach of requiring the user to pass in a DataProvider to the relevant functions, thus allowing the user of the crate to decide on the strategy. As the icu_provider crate's documentation says:

Unicode's experience with ICU4X's parent projects, ICU4C and ICU4J, led the team to realize that data management is the most critical aspect of deploying internationalization, and that it requires a high level of customization for the needs of the platform it is embedded in. As a result ICU4X comes with a selection of providers that should allow for ICU4X to naturally fit into different business and technological needs of customers.

Alright, fair enough. But for our purposes we don't need the full extent of this customization. Doing something similar to what I see is done with the pluralization rules I linked to above, that is to say bundling data with the library, seems reasonable to me. If I understand correctly the code of the cldr library, you've got a custom setup for compiling the plural rules—icu_provider can do the same sort of thing in Rust if we want to, without us needing to roll it ourselves. In this guide they explain multiple ways of doing it—we could compile the data into the .so at build time, or load the data from Elixir's priv directory, or one of a number of other strategies.

I could just settle on one of these and do it, but ideally you'd have some overarching approach for how the Elixir CLDR libraries do this sort of thing, and I'd like to follow it as much as possible.

elixir-cldr / cldr_collation

Rust-based icu_collator bindings #10