dahosek / finl_unicode

Unicode support for the finl project
Apache License 2.0
16 stars 7 forks source link

Unicode 16.0.0 #20

Closed Kijewski closed 4 weeks ago

Kijewski commented 1 month ago

Unicode 16.0.0 was released on 2024-09-10. I tested if updating the line ↓, and running generate-sources was all that need to be done to update to a new version, but the generated tests contain a failing assertion: https://github.com/dahosek/finl_unicode/blob/e7f53deaa9413edc459cf01db8816a36691b814f/generate-sources/src/main.rs#L12

--- STDERR:              finl_unicode data::grapheme_test::standard_grapheme_test ---
thread 'data::grapheme_test::standard_grapheme_test' panicked at src/grapheme_clusters.rs:463:9:
assertion `left == right` failed: Lengths did not match on Grapheme Cluster
      ÷ [0.2] DEVANAGARI LETTER KA (ConjunctLinkingScripts_LinkingConsonant) × [9.0] DEVANAGARI SIGN VIRAMA (Extend_ConjunctLinkingScripts_ConjunctLinker_ExtCccZwj) × [9.3] DEVANAGARI LETTER TA (ConjunctLinkingScripts_LinkingConsonant) ÷ [0.3]
    Output: ["क\u{94d}", "त"]
    Expected: ["क\u{94d}त"]
  left: 2
 right: 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

The changelog of Unicode 16.0.0 contains this text, which might be relevant:

There has also been a change to the Grapheme_Cluster_Break property data, extending the use of GCB=V to apply to certain non-Hangul vowels, and in particular for Kirat Rai vowels. This change finesses the behavior of the segmentation of grapheme cluster breaks in such cases, while respecting normalization requirements and canonical equivalence. Implementations should take note that GCB=V and HST=V are no longer coextensive. See UAX #29 for details.

dahosek commented 1 month ago

Unicode 15.1.0 somewhat ironically because of its version number, changed the rules for clustering in Indic scripts. I’m working on an update to the code and will push that with the update for 16.0.0 sometime in the near future.

dahosek commented 4 weeks ago

Fixed with new release