NightOwl888 / ICU4N

International Components for Unicode for .NET
Apache License 2.0
26 stars 7 forks source link

Create extension methods for common BreakIterator operations #22

Open NightOwl888 opened 5 years ago

NightOwl888 commented 5 years ago

While BreakIterator provides great low-level functionality for iterating forward and backward through breaks, it would be great if there were a simple way to do forward-only operations on string, StringBuilder, and char[].

IEnumerable<int> wordBreaks = theString.ToWordBreaks();
foreach (var break in wordBreaks)
{
    // consume
}

Or

IEnumerable<int> sentenceBreaks = theString.ToSentenceBreaks(new CultureInfo("th"));
foreach (var break in sentenceBreaks)
{
    // consume
}

We would ideally create a different extension method (with overloads for optional culture) for all 4 modes:

  1. Word
  2. Sentence
  3. Line
  4. Character

We could then expand on this to do a higher level operation, such as providing an IEnumerable<string> that would tokenize the text so it can be iterated with a foreach loop.

foreach (var word in theText.ToWords(new CultureInfo("th-th")))
{
   // consume each word
}

Some thought needs to be given to thread safety, since BreakIterator requires a separate clone for each thread.

NightOwl888 commented 4 years ago

After an attempt was done on this, it is more complicated than was first envisioned because the definition of what qualifies as a "word" could vary. Need to rethink the approach.