allevato / icu-swift

Swift APIs for ICU
Apache License 2.0
25 stars 6 forks source link

Examples on using LineBreakCursor #10

Closed RLovelett closed 6 years ago

RLovelett commented 6 years ago

I was wondering if there were examples on how to use the LineBreakCursor properly?

Also, I was wondering if you could tell me if my use case is a good use of the LineBreakCursor.

Basically, I'm trying to convert a position (line number and character) into a byte position in the String.

It seems that using LineBreakCursor should work though I cannot seem to figure out how to do it right.

RLovelett commented 6 years ago

I gave up on the LineBreakCursor idea and have since switched to the RuleBasedBreakCursor. I think I have a working solution. Though I am interested in soliciting your feedback if you would be so kind as to indulge me.

Code

/// - SeeAlso: https://stackoverflow.com/a/20056634/247730
private let lineBreakRules = """
(\\r\\n|\\r|\\n){100};
"""

private struct Line {
    let number: Int
    let start: String.Index
    let end: String.Index
}

private struct LineIterator: IteratorProtocol {

    private let cursor = try! RuleBasedBreakCursor(rules: lineBreakRules)

    private let end: String.Index

    private var previousLineNumber: Int

    init(_ text: String) {
        cursor.text = text
        end = text.endIndex
        previousLineNumber = 0
    }

    mutating func next() -> Line? {
        guard let start = cursor.index, start != end else {
            return nil
        }
        var previous: String.Index?
        repeat {
            previous = cursor.next()
        } while previous != nil && cursor.ruleStatus != 100
        guard let end = cursor.index else {
            return nil
        }
        defer { previousLineNumber += 1 }
        return Line(number: previousLineNumber, start: start, end: end)
    }

}

Questions

  1. I cannot seem to find documentation on the formatting of the RuleBasedBreakCursor rules. Or how those work. I copied some of the rules in the test suite along with some trial and error and arrived at the above. Though I'm not sure this is the optimal solution.
  2. I am trying to lazily parse the line breaks in a source document. Given the code above do you see any obvious failures? Is there something I could do better?
  3. How is moveToIndex(following:) supposed to work? Based on the first line in the documentation "Returns the first index greater than index at which a boundary occurs." I assumed that given my break rules above and the index of the first character after a line break it would read to the next line break index. This does not seem to work. I assume my understanding of the implementation is flawed. So could you correct my expectations?
allevato commented 6 years ago

LineBreakCursor (and indeed, the ICU cursor/iterators in general) are more intended to be used for linguistic breaking, not so much for problems like you described in your initial post. What LineBreakCursor does is let you iterate over a string and find out "does the language allow me to place a break here?" (a soft break) or "does the language require a break here?" (a hard break). If you look at the unit tests, it shows hard breaks wherever you have a physical newline character, and soft breaks where there are other linguistic breaks possible, like on a space or a dash.

To answer the other questions:

  1. There's some documentation on the syntax for writing custom rule-based iterators here, but it's... not great and kind of hard to parse. I haven't seen all that many useful examples of it out in the real world either, honestly. I just ported the API because I'm a stickler for thoroughness.

  2. Is there a reason you want to use one of the ICU iterators here? For the problem you're describing, I don't think they would provide a better experience than just scanning the string yourself looking for breaks. LineBreakCursor is intended to support linguistic breaks so it would likely be slower than just looking for newlines directly, and RuleBasedBreakCursor has some odd edges that I haven't quite figured out (when writing my tests, I would write some rules that failed to be detected for reasons that weren't entirely clear to me).

  3. Nothing immediately comes to mind, but if you could post a specific example text snippet and what the method returns that's unexpected, I might have some insight.

Speaking of performance, another thing to be aware of is that these cursor aren't as optimal as they could be due to limitations in Swift. ICU takes pointers to UTF-16 as input to these methods, but Swift doesn't provide a way to directly access the underlying UTF-16 of a String (because it may not always be represented that way internally). So these classes have to create a copy of the UTF-16 view as a contiguous buffer and keep that around. If all you want to do is find line breaks, you don't need to pay that penalty.

RLovelett commented 6 years ago

Is there a reason you want to use one of the ICU iterators here?

Mostly ICU was a shiny new toy that I recently discovered. So now I'm just playing with ideas of things that would need it.

Another idea I had was for writing a toy HTML parser in pure Swift. Mostly I'm looking for fun projects that can get me working with Swift Strings since I think they'll be interesting ways to learn about the sharp edges of the language.

Thank you for the feedback.