WebAssembly / WASI

WebAssembly System Interface
Other
4.92k stars 258 forks source link

Request: wasi-icu #590

Open oovm opened 8 months ago

oovm commented 8 months ago

I had some difficulties writing features such as string iterators under utf8 and wtf16, parsers with unicode character properties, and time-related formatters, and I realized that I needed some standard interfaces related to internationalization.

Can wasi standardize the International Components for Unicode interfaces?

Advantage

The host implementation can greatly reduce the size of wasm, does not need to embed a huge dictionary, no loading time, has better performance, and can maintain follow-up standards, which can evolve to a new version of icu without updating the distributed binary.

These advantages are not available with sdk embedding or guest embedding.

Moreover, some rules of the icu standard are very complex, and non-experts will implement them incorrectly. It is costly for all languages to implement them individually.

Disadvantages

Not quite in line with WebAssembly System Interfaces, but in line with WebAssembly Standard Interfaces.

Related

kaizhu256 commented 8 months ago

+1

it would allow sqlite to easily extend regexp-replace in webassembly (https://github.com/sqlite/sqlite/blob/master/ext/icu/icu.c)

sunfishcode commented 8 months ago

I wonder how feasible it would be to adapt ICU4X's language-bindings system to (semi-)automatically produce a Wit API.

devsnek commented 8 months ago

@sffc i think we discussed something like this at one point... do you think diplomat could be up to the task?

sffc commented 8 months ago

It makes sense to have bindings to system libraries in order to reduce binary size. ICU4X can of course serve as a polyfill when a platform API is unavailable. For example, Android, Windows, and iOS (and maybe others) have standard APIs that can be wrapped for a subset of i18n functionality without adding any additional dependencies.

On the specifics:

string iterators under utf8 and wtf16

Many modern programming languages give you this for free. If in Rust, UTF-8 iteration is built in, and for UTF-16 you can use the lightweight utf16_iter crate. In C++, I like to use the macros in ICU4C utf8.h or utf16.h, which do not require any runtime library dependencies (you can include them at build time only).

parsers with unicode character properties

What are you trying to parse? If you're talking about regular expressions, that is an interesting topic requiring further evaluation, because you'll need not only ICU4* for the properties but also a regex engine.

time-related formatters

This is of course ICU4*'s core competency.

Other features not mentioned here are Collator and Normalizer. These are smaller, data-heavy APIs that might be good starting points.

When doing API design, I encourage using the ECMA-402 API surface.

sunfishcode commented 8 months ago

Thanks! The main thing this needs now is for some people to volunteer to be champions, who can put together an API proposal, and ideally also a prototype implementation.

guybedford commented 8 months ago

Just to note having a standard interface here, either available as a component or host API, would be directly useful for JS component runtimes like ComponentizeJS and Fastly's JS Compute Runtime to support ECMA-402.