Closed Entomy closed 4 years ago
cc @GrabYourPitchforks
May https://github.com/dotnet/runtime/pull/328 solve your problem?
StringInfo.cs and TextElementEnumerator.cs Rewritten to use the new "extended grapheme cluster" logic in TextSegmentationUtility. Aside from changing the implementation to be UAX29-compliant, there's one additional behavioral
Take a look at the StringInfo
type. Those docs are a little outdated; we recently updated it (as of .NET 5.0+) to iterate over extended grapheme clusters in a UAX#29-compliant manner.
We also recently updated the existing StrReverse
routine to understand grapheme clusters when reversing strings.
See also: https://github.com/dotnet/runtime/issues/30955, https://github.com/dotnet/runtime/issues/19423
An important note is that these APIs operate only on the concept of extended grapheme clusters, which per UAX#29 are not linguistic or culture-aware. Once you have broken down a string into its component grapheme clusters, you could use other APIs like string.Normalize
to convert them into a uniform representation. Though if your scenario is primarily comparing two strings rather than converting one string into another representation, the easiest way to compare them would be to use CompareInfo.Compare
or string.Equals(..., StringComparison.InvariantCulture)
. Those will perform a linguistic comparison, such as treating "ae" and "æ" as equivalent.
Given this, is there a particular set of APIs you still believe to be lacking?
To clarify my above comments: you'll need to be on a nightly build to see the recent changes to StringInfo
and StrReverse
. If you're on .NET Core 3.1 or earlier, or if you're on any version of .NET Framework, you'll continue to see the old behavior.
I'll just implement it in my own libraries.
Thank you for your time.
Ok. I'll close this for now, but please reopen if you find that there's a missing building block that the framework can provide.
Quick aside: I've been putting this together on mobile during bits of downtime. I'll add in the formal API changes after.
Teal Deer
Add a new type and API's to support working with text on a grapheme basis, rather than the byte, code unit, or scalar value basis that is currently supported.
Background
I develop and maintain a set of libraries for text processing and manipulation, a parser declaration framework, and stream extensions.
Problems
There are a number of problems programmers working with text face, and might not even be aware of. I will elaborate a few of them.
Palindromes
Programmatically determining a Palindrome is immensely more complicated than nearly every programmer believes. It's a complex enough matter that I've written an entire article solely on the complexities of it.
Simply reversing and comparing only addresses a very small subset of palindromes.
You can address a decently large subset of palindromes by properly cleaning up the string beforehand, or by manually managing two cursors from each end.
But note that I said subset. I was surprised when I found this problem, but in retrospect I understand why it happened. Consider the contrived Palindrome "café éfac!" It's obviously a Palindrome. Reversing it may or may not work however. If 'é' is precomposed the whole thing works fine. However if it's combined it fails spectacularly. Array reversal doesn't take into consideration the ordered nature of UNICODE string interpretation, and you wind up with combing marks on the wrong letter!
Edit Distance
Edit-distance algorithms are one of the easiest places to show the problem, because the incorrect results can be definitively and quantifiably be shown. As a note, I'm merging string-similarity and edit-distance under the same name for this, as they both are subject to the same issue.
While there are cases of wanting edit-distance metrics at the byte or code unit or scalar value level, often times people's intention is the grapheme changes between two pieces of text. This is especially the case with spell check or plagiarism detection applications.
Consider the Hamming distance between "café" and "cafe". It's obviously 1, for 1 substitution. Right? That's certainly the intention. However if this isn't working off of graphemes, it's possible for this algorithm to return an error or throw and exception. The problem is combining marks. If 'é' is a "Latin lowercase e" and "combining acute mark" then the two strings have different lengths according to everything except graphemes, so the distance is undefined.
Even with a "better" edit-distance algorithm like Damerau-Levenshtein this problem still shows up. And even worse, is harder to detect! If 'é' is precomposed, it will correctly report 1 edit for the one substitution. But if 'é' is combining, it will correctly report 1, but for the incorrect 1 insertion! For certain words you actually begin to see inconsistent results.
You see additional problems related to this when you start to look at typographic ligatures for diphthongs. There isn't a single edit-distance algorithm I'm aware of that handles breaking of typographic ligatures. So while the French "bœef" and English "beef" would correctly be 1 edit (French œ is a true ligature), the variants "phoenix" and "phœnix" should only be considered one edit as far as English goes: a single substitution.
Parsing
This is the last one, I promise. My point will be thoroughly made after this.
Text parsing, whether done through ad hoc, string scanners, Regex, parsecs, or my own approach, are particularly vulnerable to these kinds of problems. Unless you're interested in the bytes or code units, which you're probably not if you're using these kinds of tools, you will encounter problems.
I'll explain this through how parsecs work, but this applies to every single one of these.
One of the most easy ways to understand parsing approaches is to go char by char, and build up from there. Even if "shotgun" parsers are employed, this problem still shows up. When reading this way, if expecting a 'é' and you found a 'e', the parser would correctly fail, right? But what if the very next char was a combining acute mark? It should have just succeeded. A good parser will handle this, but most don't, and working around this without a good framework is tedious and error prone.
I've tested a number of these parsing libraries across numerous languages and constantly find the same errors when it comes to handling these issues.
For programming languages and DDL's this is rarely if ever an issue. But not all parsing is of computational languages.
Proposed Solution
My idea for solving this is to add a new type and supporting API's. I've taken to calling this
Glyph
, although naming is not important, and should be debated.The idea is that
Glyph
will hold the representative scalar values (Rune
) or code units (Char
), and transform them if a transformation of the entireGlyph
is called. Equality and comparisons for this type would not be referential nor structural, but rather based upon some other means. I believe a table of equivalence would work. This is not meant for high performance scenarios, but rather, linguistic correctness. Parity withChar
andRune
API's should be implemented and maintained, such that methods likeToUpper()
orIsLetter()
exist. Furthermore, a few new API's should also be provided, such as a "proper"String.Reverse()
which reverses the graphemes.Formal API Changes
To be filled in
Pitfalls & Considerations
This isn't a simple problem to tackle, so there will be pitfalls that need to be addressed upfront. I highly doubt I've thought of everything. I'll do my best to edit in any other pitfalls people think of if this progresses.
The biggest one I've thought of is that of culture. As I alluded to earlier, there are two types of ligatures and handling them is different depending on the culture in question. For a true linguistic ligature, the grapheme is considered it's own unique phoneme, and replacement with its "decomposed" graphemes is an error. In other words, for French "oe" ≠ "œ". For typographic ligatures these are equivalent however. Going with the table approach I described, I believe this could be handled by having a large "invariant" table which is checked last, and smaller "culture" tables which are checked first. When "oe" or "œ" is considered for French, they would both have entries with no equivalence. When they are considered for English, since there is no cultural rules for that, it would fall back to the invariant table where it would find the equivalence.
Remarks
In the fifteen years I've been programming, I've focused more on text and language than anything else. When it comes down to factors like performance, ease of use, and UNICODE support, .NET has stood out to me as a leader. While it's not flawless, I think it could dominate that space.