dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.84k stars 4.62k forks source link

[wasm][globalization][icu] Tracking issue for HybridGlobalization: Web API + ICU #79989

Closed ilonatommy closed 1 month ago

ilonatommy commented 1 year ago

The task is to remove as much data from ICU files as possible and exchange ICU4C functions that are using this data with platform native functions - in the case of WASM with Web API. Because we are not able to get rid of ICU datafile completely (some functionalities are not easily replaceable) we will keep loading icudt.dat in a reduced form. This mode will be called HybridGlobalization and will be by default switched off. User can switch it on by setting MsBuild's <HybridGlobalization> to true. PoC branch is here: https://github.com/dotnet/runtime/compare/main...ilonatommy:runtime:icu-platform-native.

1) Removing collations for WASM

2) Removing normalization for WASM: Removed from planned Hybrid features. Savings from normalization removal on WASM are ~60kB. The removal breaks public APIs: string.Normalize, string.IsNormalized, IdnMapping.GetAsciii, IdnMapping.GetUnicode. Normalize/IsNormalized were succesfully replaced in https://github.com/dotnet/runtime/pull/85510. For GetAscii/GetUnicode replacement, Invariant implementation enhanced by normalization step was used, see branch https://github.com/ilonatommy/runtime/tree/idn-mapping. The mapping still lacked detection of disallowed/ignored/mapped characters and would need access to MappingTables of the current Unicode version to e.g. detect incorrect inputs to throw. One Unicode version mapping table in plain text weights ~900kB. Even if we compressed it, we still would need to maintain it with every Unicode version. Development time spent on correct implementation and chances of real size reduction, taking into cosideration the need to keep the mapping tables, are too small to remove normalization data from ICU.

3) Investigate implications of removing further data batches, e.g. check the effect of removing all collations, coll_ucadata, locales_tree etc.

4) (optional) Enhancement of collations by manual workarounds:

5) (optional) Consider failing a build when HybridGlobalization function is not supported

Tracking issues:

https://github.com/dotnet/runtime/issues/101912 https://github.com/dotnet/runtime/issues/102305 https://github.com/dotnet/runtime/issues/102373 https://github.com/dotnet/runtime/issues/95921 https://github.com/dotnet/runtime/issues/95795 https://github.com/dotnet/runtime/issues/95623

ghost commented 1 year ago

Tagging subscribers to this area: @dotnet/area-system-globalization See info in area-owners.md if you want to be subscribed.

Issue Details
The task is to remove as much data from ICU files as possible and exchange ICU4C functions that are using this data with platform native functions - in the case of WASM with Web API. Because we are not able to get rid of ICU datafile completely (some functionalities are not easily replaceable) we will keep loading `icudt.dat` in a reduced form. This mode will be called `HybridGlobalization` and will be by default switched off. User can switch it on by setting MsBuild's `` to true. PoC branch is here: https://github.com/ilonatommy/runtime/tree/icu-platform-native. 1) Removing `collations/standard` for WASM - [ ] Prepare `icudt_wasm.dat` and corresponding sharded datafiles without `collations/standard`, enable setting `HybridGlobalization` and write WBT checking if the new file got loaded instead of the old one - [ ] 2) Removing `normalization` for WASM - [ ] Update `icudt_wasm.dat` and corresponding sharded datafiles - [ ] Implement Punycode, might be using [this algorithm](www.npmjs.com/package/punycode). - [ ] Use normalization from the PoC branch. ....
Author: ilonatommy
Assignees: ilonatommy, mkhamoyan
Labels: `area-System.Globalization`
Milestone: 8.0.0
ghost commented 1 year ago

Tagging subscribers to 'arch-wasm': @lewing See info in area-owners.md if you want to be subscribed.

Issue Details
The task is to remove as much data from ICU files as possible and exchange ICU4C functions that are using this data with platform native functions - in the case of WASM with Web API. Because we are not able to get rid of ICU datafile completely (some functionalities are not easily replaceable) we will keep loading `icudt.dat` in a reduced form. This mode will be called `HybridGlobalization` and will be by default switched off. User can switch it on by setting MsBuild's `` to true. PoC branch is here: https://github.com/ilonatommy/runtime/tree/icu-platform-native. 1) Removing `collations/standard` for WASM - [ ] Prepare `icudt_wasm.dat` and corresponding sharded datafiles without `collations/standard`, enable setting `HybridGlobalization` and write WBT checking if the new file got loaded instead of the old one - [ ] 2) Removing `normalization` for WASM - [ ] Update `icudt_wasm.dat` and corresponding sharded datafiles - [ ] Implement Punycode, might be using [this algorithm](www.npmjs.com/package/punycode). - [ ] Use normalization from the PoC branch. ....
Author: ilonatommy
Assignees: ilonatommy, mkhamoyan
Labels: `arch-wasm`, `area-System.Globalization`
Milestone: 8.0.0
ilonatommy commented 1 month ago

Closing, the planned work for HybridGlobalization was completed.