BreakIterator.GetCharacterInstance() - results differ from ICU4J

michaelhkay commented 4 months ago

I'm getting different results for a number of tests (9 out of 919, so not too bad...). The test names relate to XPath 4.0 tests in

https://github.com/qt4cg/qt4tests/blob/master/fn/graphemes.xml

graphemes-1172 Input: "a🏿👶" (U+0061 U+1F3FF U+1F476) Java result: 2 strings, "a🏿", "👶" ((U+0061 U+1F3FF ; U+1F476) C# result: 3 separate single-codepoint strings.

graphemes-1173 Input: ""a🏿👶‍🛑" (U+0061 U+1F3FF U+1F476 U+200D U+1F6D1) Java result: 2 strings, "a🏿", "👶‍🛑" (U+0061 U+1F3FF ; U+1F476 U+200D U+1F6D1) C# result: 3 strings of lengths 1, 1, 3 respectively

Other failures, not specifically analysed (happy to send the results if needed):

graphemes-1180 graphemes-1181 graphemes-1182 graphemes-1183 graphemes-1184 graphemes-1185 graphemes-1189

It's possible of course that it's a Unicode version issue.

NightOwl888 commented 4 months ago

Thanks for the report.

You are correct, it could be something having to do with the version of resources. There are differences in the char.txt file between 60.1 and 75.1:

https://github.com/unicode-org/icu/blob/release-60-1/icu4c/source/data/brkitr/rules/char.txt https://github.com/unicode-org/icu/blob/release-75-1/icu4c/source/data/brkitr/rules/char.txt

I believe the crux of the issue is this change:

60.1

# Emoji defintions

$E_Base      = [\p{Grapheme_Cluster_Break = EB}];
$E_Modifier  = [\p{Grapheme_Cluster_Break = EM}];

# Data for Extended Pictographic scraped from CLDR common/properties/ExtendedPictographic.txt, r13267
$Extended_Pict = [\U0001F774-\U0001F77F\U00002700-\U00002701\U00002703-\U00002704\U0000270E\U00002710-\U00002711\U00002765-\U00002767\U0001F030-\U0001F093\U0001F094-\U0001F09F\U0001F10D-\U0001F10F\U0001F12F\U0001F16C-\U0001F16F\U0001F1AD-\U0001F1E5\U0001F260-\U0001F265\U0001F203-\U0001F20F\U0001F23C-\U0001F23F\U0001F249-\U0001F24F\U0001F252-\U0001F25F\U0001F266-\U0001F2FF\U0001F7D5-\U0001F7FF\U0001F000-\U0001F003\U0001F005-\U0001F02B\U0001F02C-\U0001F02F\U0001F322-\U0001F323\U0001F394-\U0001F395\U0001F398\U0001F39C-\U0001F39D\U0001F3F1-\U0001F3F2\U0001F3F6\U0001F4FE\U0001F53E-\U0001F548\U0001F54F\U0001F568-\U0001F56E\U0001F571-\U0001F572\U0001F57B-\U0001F586\U0001F588-\U0001F589\U0001F58E-\U0001F58F\U0001F591-\U0001F594\U0001F597-\U0001F5A3\U0001F5A6-\U0001F5A7\U0001F5A9-\U0001F5B0\U0001F5B3-\U0001F5BB\U0001F5BD-\U0001F5C1\U0001F5C5-\U0001F5D0\U0001F5D4-\U0001F5DB\U0001F5DF-\U0001F5E0\U0001F5E2\U0001F5E4-\U0001F5E7\U0001F5E9-\U0001F5EE\U0001F5F0-\U0001F5F2\U0001F5F4-\U0001F5F9\U00002605\U00002607-\U0000260D\U0000260F-\U00002610\U00002612\U00002616-\U00002617\U00002619-\U0000261C\U0000261E-\U0000261F\U00002621\U00002624-\U00002625\U00002627-\U00002629\U0000262B-\U0000262D\U00002630-\U00002637\U0000263B-\U00002647\U00002654-\U0000265F\U00002661-\U00002662\U00002664\U00002667\U00002669-\U0000267A\U0000267C-\U0000267E\U00002680-\U00002691\U00002695\U00002698\U0000269A\U0000269D-\U0000269F\U000026A2-\U000026A9\U000026AC-\U000026AF\U000026B2-\U000026BC\U000026BF-\U000026C3\U000026C6-\U000026C7\U000026C9-\U000026CD\U000026D0\U000026D2\U000026D5-\U000026E8\U000026EB-\U000026EF\U000026F6\U000026FB-\U000026FC\U000026FE-\U000026FF\U00002388\U0001FA00-\U0001FFFD\U0001F0A0-\U0001F0AE\U0001F0B1-\U0001F0BF\U0001F0C1-\U0001F0CF\U0001F0D1-\U0001F0F5\U0001F0AF-\U0001F0B0\U0001F0C0\U0001F0D0\U0001F0F6-\U0001F0FF\U0001F80C-\U0001F80F\U0001F848-\U0001F84F\U0001F85A-\U0001F85F\U0001F888-\U0001F88F\U0001F8AE-\U0001F8FF\U0001F900-\U0001F90B\U0001F91F\U0001F928-\U0001F92F\U0001F931-\U0001F932\U0001F94C\U0001F95F-\U0001F96B\U0001F992-\U0001F997\U0001F9D0-\U0001F9E6\U0001F90C-\U0001F90F\U0001F93F\U0001F94D-\U0001F94F\U0001F96C-\U0001F97F\U0001F998-\U0001F9BF\U0001F9C1-\U0001F9CF\U0001F9E7-\U0001F9FF\U0001F6C6-\U0001F6CA\U0001F6D3-\U0001F6D4\U0001F6E6-\U0001F6E8\U0001F6EA\U0001F6F1-\U0001F6F2\U0001F6F7-\U0001F6F8\U0001F6D5-\U0001F6DF\U0001F6ED-\U0001F6EF\U0001F6F9-\U0001F6FF];
$E_Base_GAZ  = [\p{Grapheme_Cluster_Break = EBG}];
$EmojiNRK    = [[\p{Emoji}] - [\p{Grapheme_Cluster_Break = Regional_Indicator}*\u00230-9©®™〰〽]];

75.1

# Emoji definitions

$Extended_Pict = [:ExtPict:];

There are 2 data formats that are supported by RuleBasedBreakIterator:

Plain text (string).
Binary .brk. This must be in big endian format (ICU4C uses little endian). The compiled file can be located in the maven package (available at Maven Central) by unzipping it and navigating to com\ibm\icu\impl\data\icudt75b\brkitr.

Unfortunately, the binary files are not portable across ICU versions. But it is possible to create one by importing the rules as text and then exporting the binary file.

Step 1 - Modify

ICU 60.1 doesn't support the property [:ExtPict:] directly. So, the easiest way to fix that would be to use UnicodeSet in the latest version ICU4J and generate the pattern string that will work in 60.1.

UnicodeSet set = new UnicodeSet("[:ExtPict:]");
String pattern = set.toPattern(true);

Use the result to replace the $Extended_Pict variable.

I am not sure whether any other changes are required. Consult the Break Rules docs.

Note that loading the rules via string to test them can be done using the new RuleBasedBreakIterator(string) constructor.

Step 2 - Compile

Once you have updated the char.txt file from 60.1 with the pattern string, the next step is to compile them to a .brk file on either ICU4N or ICU4J 60.1. The format is identical. I am showing the code in C#.


// This is the modified char.txt file to compile
string rules = System.IO.File.ReadAllText(@"path/to/char.txt", System.Text.Encoding.UTF8);

// This is the modified char.brk file that will be output
using (var binaryOutputStream = new System.IO.FileStream(@"path/to/custom-char.brk", System.IO.FileMode.CreateNew, System.IO.FileAccess.ReadWrite))
{
    RuleBasedBreakIterator.CompileRules(rules, binaryOutputStream);
}

Step 3 - Load

Once you have a .brk file, you can either:

Add the .brk file to a custom NuGet .nupkg file. This will require compiling the file into the root ICU4N.dll satellite assembly using the al.exe tool that is only supported on Windows then packing it along with the rest of the standard resource files. This can be simplified by using a modified version of the ICU4JResourceConverter, which is not yet documented and is not yet available in binary form and some build bits from the ICU4N.csproj file.
Add the .brk file to the data folder as described on the README. This is not ideal for wide distribution because the data files are not shadow copied on Internet Information Services and hence breaks X-copy deployment while the files are locked by the OS.
Load the .brk file manually using RuleBasedBreakIterator. The code for this is below.


// This will read the modified char.brk file
using var rulesInputStream = new System.IO.FileStream(@"path/to/custom-char.brk", System.IO.FileMode.Open, System.IO.FileAccess.Read);
BreakIterator breakIterator = RuleBasedBreakIterator.GetInstanceFromCompiledRules(rulesInputStream);

Of course, another option could be to simply accept there will be differences between versions.

TIP: You can use the UnicodeSet to check differences between characters in different versions of Unicode. Of course, you will need to check on a later version of ICU to see what changed since ICU 60.1.

import com.ibm.icu.text.UnicodeSet;

public class UnicodeSetDifference {

    public static void main(String[] args) {
        // Unicode versions corresponding to ICU4J 60.1 and 75.1
        String unicodeVersion60 = "10.0";
        String unicodeVersion75 = "15.1";

        // Initialize UnicodeSets for the specified Unicode versions
        UnicodeSet set60 = new UnicodeSet("[\\p{age=" + unicodeVersion60 + "}]");
        UnicodeSet set75 = new UnicodeSet("[\\p{age=" + unicodeVersion75 + "}]");

        // Create a copy of set75 to find the difference
        UnicodeSet difference = new UnicodeSet(set75);

        // Remove all characters that are in set60 from the difference set
        difference.removeAll(set60);

        // Print the difference
        System.out.println("Characters in ICU4J 75.1 that are not in ICU4J 60.1: " + difference);
    }
}

michaelhkay commented 4 months ago

Many thanks for your analysis. I'm assuming this will be fixed in a future release? Happy to do nothing until that happens.

NightOwl888 commented 4 months ago

Yes, these differences will go away when ICU4N is upgraded. I think we can consider this closed, but feel free to reopen if there is a related bug.

michaelhkay commented 2 weeks ago

After upgrading to 60.1.0-alpha-436, only one of these tests is now failing, namely graphemes-1172,

NightOwl888 / ICU4N