NightOwl888 / ICU4N

International Components for Unicode for .NET
Apache License 2.0
26 stars 7 forks source link

Convert CharacterIterator into ICharacterEnumerator and move to J2N #23

Open NightOwl888 opened 4 years ago

NightOwl888 commented 4 years ago

CharacterIterator and classes that depend on it are the only classes in ICU4N.Support that are still marked as public. Ideally, when ICU4N is released, there will be no public facing ICU4N.Support namespace.

The CharacterIterator class should be converted into an ICharacterEnumerator interface and moved to J2N, and all implementations converted also.

This has been attempted and is partially completed, but due to the fact that CharacterIterator uses post-increment behavior, it doesn't work very well. It was fairly easy to get J2N.Text.StringCharacterEnumerator working with ICU4N, but not so for Lucene.NET.

The approach taken was to make a wrapper class so the ICharacterEnumerator could be passed in, and behind the scenes it would be wrapped by a class that implements CharacterIterator (which was made internal). This didn't work in the opposite direction when converting the rest of the CharacterIterator classes into ICharacterEnumerator instances that can be passed to implementations of the BreakIterator abstract class. I am sure it is possible, but more effort is required to work out how to make it behave correctly (being that iterators return a value, and enumerators return true/false and then a property must be read but in the case of CharacterIterator, the property must somehow be read before the call to MoveNext() or MovePrevious()).

NightOwl888 commented 1 month ago

The current thinking here is not to move this to J2N and to find a way to factor it out. Wrapping a character sequence in a class is the wrong way in .NET. ReadOnlySpan<char> is a ref struct and may not be used as a class field. So, without changing the design, our only option is ReadOnlyMemory<char>. The issue with using it is that even if the memory passed in using ReadOnlySpan<char> is on the heap, it still must be copied to an array to get it onto the stack.

We have already started phasing out some of the CharacterIterator subclasses, but some components (such as BreakIterator) either need to be duplicated for use on the stack or will need significant design changes. If the break rules can be separated out into a pluggable heap object, it can be provided to the constructor of a ref struct. Alternatively, this break rules object could be passed into a set of static methods that retrieve the next position, previous position, etc. based on an input position, text, and set of rules.

Food for thought.