Closed lijunsong closed 5 months ago
Great question! You've prompted me to write some documentation to try to help cover this. The values in those TSRange
objects are in bytes, not characters. They must be translated back to UTF-16 if you want NSRange-compatible values, which is the default. I also think there also may be a problem with your InputEdit
byte calculations...
Does any of this help?
https://github.com/chimeHQ/SwiftTreeSitter?tab=readme-ov-file#range-translation
This might be the right direction to explain the "bogus" range, but could you help me understand a bit more here:
First, what does this example have anything to do with utf16? all strings are encoded using utf-8 in swift in this example.
I did some prints out:
print("let v = 1".lengthOfBytes(using: .utf8)) // 9
print("let v = 1".lengthOfBytes(using: .utf16)) // 18
Second, my understanding of the byte calculation is that: utf8 string is the same as the normal ascii byte range when the utf8 contains only ascii. (ascii is a subset of utf8, not a different encoding, so the ranges, indexing are exactly the same).
But in your code where
public extension NSRange {
var byteRange: Range<UInt32> {
let lower = UInt32(location * 2)
let upper = UInt32(NSMaxRange(self) * 2)
return lower..<upper
}
}
public extension Range where Element == UInt32 {
var range: NSRange {
let start = lowerBound / 2
let end = upperBound / 2
return NSRange(start..<end)
}
}
Seems always assume the utf8 to ascii conversion is 2 bytes more, which seems wrong to me.
Seems always assume the utf8 to ascii conversion is 2 bytes more, which seems wrong to me.
Yes very wrong if NSRange was a generalized index API! But it is assumped throughout this system that it is really an API to NSString, and that uses UTF-16 exclusively. If you want to operate on Swift strings natively with UTF-8 encoding it should be possible, but it will require some manual work from you.
I use this almost entirely with the Apple text system infrastructure, which is built around NSString. Are you doing something different?
Interesting but I am confused now:
Parser.parse()
take a NSString instead of a string instead?changedRanges
result uses utf16.I was trying it in the UIKit. UITextView's var text: String!
is String type.
This library is not only for UTF-16. It can support UTF-8 if you want, but you'll have to do a lot more work. And, it will make all of the NSRange
-based APIs incorrect, becuase I've defined that to use UTF-16 to match Foundation.
You can work in UTF-8 with this call:
I've done all the internal encoding handling and bridging require to make Swift strings work as expected. This is also why the dumped tree seems right. But it will only be correct for strings where the UTF-8 and UTF-16 representations are the same.
And, just to make things more complex, UITextView
's text
value is an NSString
, but it is transparently bridged to String in Swift.
I agree. A few things break here. For example, I expect this code to work
let language = Language(language: tree_sitter_swift())
let parser = Parser()
try! parser.setLanguage(language)
let src = "let 变量 = 1"
let tree = parser.parse(tree: nil as Tree?, encoding: TSInputEncodingUTF8, readBlock: { i, p -> Data? in
let idx = src.index(src.startIndex, offsetBy: i)
if idx == src.endIndex {
return nil
}
let data = src[idx...idx].data(using: .utf8)
print("src parse request position \(i): \(data)")
return data
})
dumpTree(node: tree?.rootNode, depth: 1)
the output shows
src parse request position 0: Optional(1 bytes)
src parse request position 1: Optional(1 bytes)
src parse request position 2: Optional(1 bytes)
src parse request position 3: Optional(1 bytes)
src parse request position 4: Optional(3 bytes)
src parse request position 3: Optional(1 bytes)
src parse request position 4: Optional(3 bytes)
src parse request position 7: Optional(1 bytes)
src parse request position 8: Optional(1 bytes)
src parse request position 9: Optional(1 bytes)
src parse request position 8: Optional(1 bytes)
src parse request position 9: Optional(1 bytes)
|- <property_declaration range: {0, 5} childCount: 4>
|- <value_binding_pattern range: {0, 1} childCount: 1>
|- <let range: {0, 1} childCount: 0>
|- <pattern range: {2, 1} childCount: 1>
|- <simple_identifier range: {2, 1} childCount: 0>
|- <= range: {3, 1} childCount: 0>
|- <integer_literal range: {4, 1} childCount: 0>
where the range looks wrong in the tree this time.
All I need is a direct binding to ts_parsr_parse
without any assumption here
TSTree *ts_parser_parse(
TSParser *self,
const TSTree *old_tree,
TSInput input
);
I'll take a look at how I can do it myself.
Feel free to close this issue. Appreciate the discussion!
While we're still on it, I have another question on the byte offset this library provides:
let's assume everything we've discussed works out fine, how to convert changedRanges
result (which provides byte offset) back to the swift String/NSString's offset? I don't find efficient way to convert it.
All I need is a direct binding to
ts_parsr_parse
without any assumption here
I haven't done any work with UTF-8 encoding, becuase it isn't natively supported by Apple's text systems. But, I would like it make it possible.
However, the node printing code does assume UTF-16 (it is using NSRange). So, dumpTree
is giving you incorrect values here. There's no way to change this today, but I think some changes to the Node type could make this possible.
let's assume everything we've discussed works out fine, how to convert
changedRanges
result (which provides byte offset) back to the swift String/NSString's offset? I don't find efficient way to convert it.
Converting Range<UInt32>
byte ranges back to UTF-16 NSRange
can be done with the .range
method extension. Converting UTF-8 data is much more complex, which another reason I didn't add built-in support for it.
I haven't done any work with UTF-8 encoding, becuase it isn't natively supported by Apple's text systems. But, I would like it make it possible.
Nah, I think it's OK. As long as it's consistent. I made changes such that
InputEdit
starts to use str.lengthOfBytes(using: .utf16)
so it calculates the bytes in terms of utf16.changeRange.bytes.range
to get a consistent view of the underlying stringOk great, I'm glad you've got something working!
If you are interfacing directly with the text system via UITextView
, you might want to take a look at Neon. Aside from handling all of these details, it also has a lot of high-level features you might like as well.
A quick example to reproduce is to parse this markdown:
where the
dumpTree
function simply walks through the tree root to dump all nodes.It prints
You see there are many bogus ranges in the changed ranges when the total byte is only
0...9
.Anything I am missing?
I am using the
main
branch (swift package shows the version is 0.20.9).