Closed RLovelett closed 6 years ago
I gave up on the LineBreakCursor
idea and have since switched to the RuleBasedBreakCursor
. I think I have a working solution. Though I am interested in soliciting your feedback if you would be so kind as to indulge me.
/// - SeeAlso: https://stackoverflow.com/a/20056634/247730
private let lineBreakRules = """
(\\r\\n|\\r|\\n){100};
"""
private struct Line {
let number: Int
let start: String.Index
let end: String.Index
}
private struct LineIterator: IteratorProtocol {
private let cursor = try! RuleBasedBreakCursor(rules: lineBreakRules)
private let end: String.Index
private var previousLineNumber: Int
init(_ text: String) {
cursor.text = text
end = text.endIndex
previousLineNumber = 0
}
mutating func next() -> Line? {
guard let start = cursor.index, start != end else {
return nil
}
var previous: String.Index?
repeat {
previous = cursor.next()
} while previous != nil && cursor.ruleStatus != 100
guard let end = cursor.index else {
return nil
}
defer { previousLineNumber += 1 }
return Line(number: previousLineNumber, start: start, end: end)
}
}
RuleBasedBreakCursor
rules. Or how those work. I copied some of the rules in the test suite along with some trial and error and arrived at the above. Though I'm not sure this is the optimal solution.moveToIndex(following:)
supposed to work? Based on the first line in the documentation "Returns the first index greater than index
at which a boundary occurs." I assumed that given my break rules above and the index of the first character after a line break it would read to the next line break index. This does not seem to work. I assume my understanding of the implementation is flawed. So could you correct my expectations?LineBreakCursor
(and indeed, the ICU cursor/iterators in general) are more intended to be used for linguistic breaking, not so much for problems like you described in your initial post. What LineBreakCursor
does is let you iterate over a string and find out "does the language allow me to place a break here?" (a soft break) or "does the language require a break here?" (a hard break). If you look at the unit tests, it shows hard breaks wherever you have a physical newline character, and soft breaks where there are other linguistic breaks possible, like on a space or a dash.
To answer the other questions:
There's some documentation on the syntax for writing custom rule-based iterators here, but it's... not great and kind of hard to parse. I haven't seen all that many useful examples of it out in the real world either, honestly. I just ported the API because I'm a stickler for thoroughness.
Is there a reason you want to use one of the ICU iterators here? For the problem you're describing, I don't think they would provide a better experience than just scanning the string yourself looking for breaks. LineBreakCursor
is intended to support linguistic breaks so it would likely be slower than just looking for newlines directly, and RuleBasedBreakCursor
has some odd edges that I haven't quite figured out (when writing my tests, I would write some rules that failed to be detected for reasons that weren't entirely clear to me).
Nothing immediately comes to mind, but if you could post a specific example text snippet and what the method returns that's unexpected, I might have some insight.
Speaking of performance, another thing to be aware of is that these cursor aren't as optimal as they could be due to limitations in Swift. ICU takes pointers to UTF-16 as input to these methods, but Swift doesn't provide a way to directly access the underlying UTF-16 of a String
(because it may not always be represented that way internally). So these classes have to create a copy of the UTF-16 view as a contiguous buffer and keep that around. If all you want to do is find line breaks, you don't need to pay that penalty.
Is there a reason you want to use one of the ICU iterators here?
Mostly ICU was a shiny new toy that I recently discovered. So now I'm just playing with ideas of things that would need it.
Another idea I had was for writing a toy HTML parser in pure Swift. Mostly I'm looking for fun projects that can get me working with Swift Strings since I think they'll be interesting ways to learn about the sharp edges of the language.
Thank you for the feedback.
I was wondering if there were examples on how to use the
LineBreakCursor
properly?Also, I was wondering if you could tell me if my use case is a good use of the
LineBreakCursor
.Basically, I'm trying to convert a position (line number and character) into a byte position in the
String
.It seems that using
LineBreakCursor
should work though I cannot seem to figure out how to do it right.