Closed michaelhkay closed 4 months ago
Thanks for the report.
You are correct, it could be something having to do with the version of resources. There are differences in the char.txt
file between 60.1 and 75.1:
https://github.com/unicode-org/icu/blob/release-60-1/icu4c/source/data/brkitr/rules/char.txt https://github.com/unicode-org/icu/blob/release-75-1/icu4c/source/data/brkitr/rules/char.txt
I believe the crux of the issue is this change:
# Emoji defintions
$E_Base = [\p{Grapheme_Cluster_Break = EB}];
$E_Modifier = [\p{Grapheme_Cluster_Break = EM}];
# Data for Extended Pictographic scraped from CLDR common/properties/ExtendedPictographic.txt, r13267
$Extended_Pict = [\U0001F774-\U0001F77F\U00002700-\U00002701\U00002703-\U00002704\U0000270E\U00002710-\U00002711\U00002765-\U00002767\U0001F030-\U0001F093\U0001F094-\U0001F09F\U0001F10D-\U0001F10F\U0001F12F\U0001F16C-\U0001F16F\U0001F1AD-\U0001F1E5\U0001F260-\U0001F265\U0001F203-\U0001F20F\U0001F23C-\U0001F23F\U0001F249-\U0001F24F\U0001F252-\U0001F25F\U0001F266-\U0001F2FF\U0001F7D5-\U0001F7FF\U0001F000-\U0001F003\U0001F005-\U0001F02B\U0001F02C-\U0001F02F\U0001F322-\U0001F323\U0001F394-\U0001F395\U0001F398\U0001F39C-\U0001F39D\U0001F3F1-\U0001F3F2\U0001F3F6\U0001F4FE\U0001F53E-\U0001F548\U0001F54F\U0001F568-\U0001F56E\U0001F571-\U0001F572\U0001F57B-\U0001F586\U0001F588-\U0001F589\U0001F58E-\U0001F58F\U0001F591-\U0001F594\U0001F597-\U0001F5A3\U0001F5A6-\U0001F5A7\U0001F5A9-\U0001F5B0\U0001F5B3-\U0001F5BB\U0001F5BD-\U0001F5C1\U0001F5C5-\U0001F5D0\U0001F5D4-\U0001F5DB\U0001F5DF-\U0001F5E0\U0001F5E2\U0001F5E4-\U0001F5E7\U0001F5E9-\U0001F5EE\U0001F5F0-\U0001F5F2\U0001F5F4-\U0001F5F9\U00002605\U00002607-\U0000260D\U0000260F-\U00002610\U00002612\U00002616-\U00002617\U00002619-\U0000261C\U0000261E-\U0000261F\U00002621\U00002624-\U00002625\U00002627-\U00002629\U0000262B-\U0000262D\U00002630-\U00002637\U0000263B-\U00002647\U00002654-\U0000265F\U00002661-\U00002662\U00002664\U00002667\U00002669-\U0000267A\U0000267C-\U0000267E\U00002680-\U00002691\U00002695\U00002698\U0000269A\U0000269D-\U0000269F\U000026A2-\U000026A9\U000026AC-\U000026AF\U000026B2-\U000026BC\U000026BF-\U000026C3\U000026C6-\U000026C7\U000026C9-\U000026CD\U000026D0\U000026D2\U000026D5-\U000026E8\U000026EB-\U000026EF\U000026F6\U000026FB-\U000026FC\U000026FE-\U000026FF\U00002388\U0001FA00-\U0001FFFD\U0001F0A0-\U0001F0AE\U0001F0B1-\U0001F0BF\U0001F0C1-\U0001F0CF\U0001F0D1-\U0001F0F5\U0001F0AF-\U0001F0B0\U0001F0C0\U0001F0D0\U0001F0F6-\U0001F0FF\U0001F80C-\U0001F80F\U0001F848-\U0001F84F\U0001F85A-\U0001F85F\U0001F888-\U0001F88F\U0001F8AE-\U0001F8FF\U0001F900-\U0001F90B\U0001F91F\U0001F928-\U0001F92F\U0001F931-\U0001F932\U0001F94C\U0001F95F-\U0001F96B\U0001F992-\U0001F997\U0001F9D0-\U0001F9E6\U0001F90C-\U0001F90F\U0001F93F\U0001F94D-\U0001F94F\U0001F96C-\U0001F97F\U0001F998-\U0001F9BF\U0001F9C1-\U0001F9CF\U0001F9E7-\U0001F9FF\U0001F6C6-\U0001F6CA\U0001F6D3-\U0001F6D4\U0001F6E6-\U0001F6E8\U0001F6EA\U0001F6F1-\U0001F6F2\U0001F6F7-\U0001F6F8\U0001F6D5-\U0001F6DF\U0001F6ED-\U0001F6EF\U0001F6F9-\U0001F6FF];
$E_Base_GAZ = [\p{Grapheme_Cluster_Break = EBG}];
$EmojiNRK = [[\p{Emoji}] - [\p{Grapheme_Cluster_Break = Regional_Indicator}*\u00230-9©®™〰〽]];
# Emoji definitions
$Extended_Pict = [:ExtPict:];
There are 2 data formats that are supported by RuleBasedBreakIterator
:
.brk
. This must be in big endian format (ICU4C uses little endian). The compiled file can be located in the maven package (available at Maven Central) by unzipping it and navigating to com\ibm\icu\impl\data\icudt75b\brkitr
.Unfortunately, the binary files are not portable across ICU versions. But it is possible to create one by importing the rules as text and then exporting the binary file.
ICU 60.1 doesn't support the property [:ExtPict:]
directly. So, the easiest way to fix that would be to use UnicodeSet in the latest version ICU4J and generate the pattern string that will work in 60.1.
UnicodeSet set = new UnicodeSet("[:ExtPict:]");
String pattern = set.toPattern(true);
Use the result to replace the $Extended_Pict
variable.
I am not sure whether any other changes are required. Consult the Break Rules docs.
Note that loading the rules via string to test them can be done using the
new RuleBasedBreakIterator(string)
constructor.
Once you have updated the char.txt
file from 60.1 with the pattern string, the next step is to compile them to a .brk
file on either ICU4N or ICU4J 60.1. The format is identical. I am showing the code in C#.
// This is the modified char.txt file to compile
string rules = System.IO.File.ReadAllText(@"path/to/char.txt", System.Text.Encoding.UTF8);
// This is the modified char.brk file that will be output
using (var binaryOutputStream = new System.IO.FileStream(@"path/to/custom-char.brk", System.IO.FileMode.CreateNew, System.IO.FileAccess.ReadWrite))
{
RuleBasedBreakIterator.CompileRules(rules, binaryOutputStream);
}
Once you have a .brk
file, you can either:
.brk
file to a custom NuGet .nupkg
file. This will require compiling the file into the root ICU4N.dll
satellite assembly using the al.exe
tool that is only supported on Windows then packing it along with the rest of the standard resource files. This can be simplified by using a modified version of the ICU4JResourceConverter, which is not yet documented and is not yet available in binary form and some build bits from the ICU4N.csproj file..brk
file to the data folder as described on the README. This is not ideal for wide distribution because the data files are not shadow copied on Internet Information Services and hence breaks X-copy deployment while the files are locked by the OS..brk
file manually using RuleBasedBreakIterator
. The code for this is below.
// This will read the modified char.brk file
using var rulesInputStream = new System.IO.FileStream(@"path/to/custom-char.brk", System.IO.FileMode.Open, System.IO.FileAccess.Read);
BreakIterator breakIterator = RuleBasedBreakIterator.GetInstanceFromCompiledRules(rulesInputStream);
Of course, another option could be to simply accept there will be differences between versions.
TIP: You can use the
UnicodeSet
to check differences between characters in different versions of Unicode. Of course, you will need to check on a later version of ICU to see what changed since ICU 60.1.import com.ibm.icu.text.UnicodeSet; public class UnicodeSetDifference { public static void main(String[] args) { // Unicode versions corresponding to ICU4J 60.1 and 75.1 String unicodeVersion60 = "10.0"; String unicodeVersion75 = "15.1"; // Initialize UnicodeSets for the specified Unicode versions UnicodeSet set60 = new UnicodeSet("[\\p{age=" + unicodeVersion60 + "}]"); UnicodeSet set75 = new UnicodeSet("[\\p{age=" + unicodeVersion75 + "}]"); // Create a copy of set75 to find the difference UnicodeSet difference = new UnicodeSet(set75); // Remove all characters that are in set60 from the difference set difference.removeAll(set60); // Print the difference System.out.println("Characters in ICU4J 75.1 that are not in ICU4J 60.1: " + difference); } }
Many thanks for your analysis. I'm assuming this will be fixed in a future release? Happy to do nothing until that happens.
Yes, these differences will go away when ICU4N is upgraded. I think we can consider this closed, but feel free to reopen if there is a related bug.
After upgrading to 60.1.0-alpha-436, only one of these tests is now failing, namely graphemes-1172,
I'm getting different results for a number of tests (9 out of 919, so not too bad...). The test names relate to XPath 4.0 tests in
https://github.com/qt4cg/qt4tests/blob/master/fn/graphemes.xml
graphemes-1172 Input: "a🏿👶" (U+0061 U+1F3FF U+1F476) Java result: 2 strings, "a🏿", "👶" ((U+0061 U+1F3FF ; U+1F476) C# result: 3 separate single-codepoint strings.
graphemes-1173 Input: ""a🏿👶🛑" (U+0061 U+1F3FF U+1F476 U+200D U+1F6D1) Java result: 2 strings, "a🏿", "👶🛑" (U+0061 U+1F3FF ; U+1F476 U+200D U+1F6D1) C# result: 3 strings of lengths 1, 1, 3 respectively
Other failures, not specifically analysed (happy to send the results if needed):
graphemes-1180 graphemes-1181 graphemes-1182 graphemes-1183 graphemes-1184 graphemes-1185 graphemes-1189
It's possible of course that it's a Unicode version issue.