ChimeHQ / SwiftTreeSitter

Swift API for the tree-sitter incremental parsing system
BSD 3-Clause "New" or "Revised" License
271 stars 30 forks source link

tree.changedRanges returns wrong range #25

Closed lijunsong closed 5 months ago

lijunsong commented 6 months ago

A quick example to reproduce is to parse this markdown:

        let language = Language(language: tree_sitter_swift())
        let parser = Parser()
        try! parser.setLanguage(language)
        let tree = parser.parse("let v = 1")
        tree?.edit(InputEdit(
            startByte: UInt32(0),
            oldEndByte: UInt32(8),
            newEndByte: UInt32(9),
            startPoint: Point(row: 0, column: 0),
            oldEndPoint: Point(row: 0, column: 8),
            newEndPoint: Point(row: 0, column: 9)
        ))
        let newTree = parser.parse(tree: tree, string: "let v1 = 1")
        let changeRanges = tree?.changedRanges(from: newTree!)
        print("changed ranges are \(changeRanges!)")
        dumpTree(node: newTree?.rootNode, depth: 1)

where the dumpTree function simply walks through the tree root to dump all nodes.

It prints

changed ranges are [<TSRange 6..<9, {0, 6}..<{0, 9}>, <TSRange 11..<12, {0, 11}..<{0, 12}>, <TSRange 13..<16, {0, 13}..<{0, 16}>, <TSRange 17..<20, {0, 17}..<{0, 20}>]
  |- <property_declaration range: {0, 10} childCount: 4>
    |- <value_binding_pattern range: {0, 3} childCount: 1>
      |- <let range: {0, 3} childCount: 0>
    |- <pattern range: {4, 2} childCount: 1>
      |- <simple_identifier range: {4, 2} childCount: 0>
    |- <= range: {7, 1} childCount: 0>
    |- <integer_literal range: {9, 1} childCount: 0>

You see there are many bogus ranges in the changed ranges when the total byte is only 0...9.

Anything I am missing?

I am using the main branch (swift package shows the version is 0.20.9).

mattmassicotte commented 6 months ago

Great question! You've prompted me to write some documentation to try to help cover this. The values in those TSRange objects are in bytes, not characters. They must be translated back to UTF-16 if you want NSRange-compatible values, which is the default. I also think there also may be a problem with your InputEdit byte calculations...

Does any of this help?

https://github.com/chimeHQ/SwiftTreeSitter?tab=readme-ov-file#range-translation

lijunsong commented 6 months ago

This might be the right direction to explain the "bogus" range, but could you help me understand a bit more here:

First, what does this example have anything to do with utf16? all strings are encoded using utf-8 in swift in this example.

I did some prints out:

print("let v = 1".lengthOfBytes(using: .utf8)) // 9
print("let v = 1".lengthOfBytes(using: .utf16)) // 18

Second, my understanding of the byte calculation is that: utf8 string is the same as the normal ascii byte range when the utf8 contains only ascii. (ascii is a subset of utf8, not a different encoding, so the ranges, indexing are exactly the same).

But in your code where

public extension NSRange {
    var byteRange: Range<UInt32> {
        let lower = UInt32(location * 2)
        let upper = UInt32(NSMaxRange(self) * 2)

        return lower..<upper
    }
}

public extension Range where Element == UInt32 {
    var range: NSRange {
        let start = lowerBound / 2
        let end = upperBound / 2

        return NSRange(start..<end)
    }
}

Seems always assume the utf8 to ascii conversion is 2 bytes more, which seems wrong to me.

mattmassicotte commented 6 months ago

Seems always assume the utf8 to ascii conversion is 2 bytes more, which seems wrong to me.

Yes very wrong if NSRange was a generalized index API! But it is assumped throughout this system that it is really an API to NSString, and that uses UTF-16 exclusively. If you want to operate on Swift strings natively with UTF-8 encoding it should be possible, but it will require some manual work from you.

I use this almost entirely with the Apple text system infrastructure, which is built around NSString. Are you doing something different?

lijunsong commented 6 months ago

Interesting but I am confused now:

  1. If this library is only for UTF-16, shouldn't Parser.parse() take a NSString instead of a string instead?
  2. The dumped tree range seems correct, only the changedRanges result uses utf16.

I was trying it in the UIKit. UITextView's var text: String! is String type.

mattmassicotte commented 6 months ago

This library is not only for UTF-16. It can support UTF-8 if you want, but you'll have to do a lot more work. And, it will make all of the NSRange-based APIs incorrect, becuase I've defined that to use UTF-16 to match Foundation.

You can work in UTF-8 with this call:

https://github.com/ChimeHQ/SwiftTreeSitter/blob/a1d25f2f3c5a33369a74a8103b4e5b2c39ba481f/Sources/SwiftTreeSitter/Parser.swift#L110

I've done all the internal encoding handling and bridging require to make Swift strings work as expected. This is also why the dumped tree seems right. But it will only be correct for strings where the UTF-8 and UTF-16 representations are the same.

And, just to make things more complex, UITextView's text value is an NSString, but it is transparently bridged to String in Swift.

lijunsong commented 6 months ago

I agree. A few things break here. For example, I expect this code to work

        let language = Language(language: tree_sitter_swift())
        let parser = Parser()
        try! parser.setLanguage(language)
        let src = "let 变量 = 1"
        let tree = parser.parse(tree: nil as Tree?, encoding: TSInputEncodingUTF8, readBlock: { i, p -> Data? in
            let idx = src.index(src.startIndex, offsetBy: i)
            if idx == src.endIndex {
                return nil
            }
            let data = src[idx...idx].data(using: .utf8)
            print("src parse request position \(i): \(data)")
            return data
        })
        dumpTree(node: tree?.rootNode, depth: 1)

the output shows

src parse request position 0: Optional(1 bytes)
src parse request position 1: Optional(1 bytes)
src parse request position 2: Optional(1 bytes)
src parse request position 3: Optional(1 bytes)
src parse request position 4: Optional(3 bytes)
src parse request position 3: Optional(1 bytes)
src parse request position 4: Optional(3 bytes)
src parse request position 7: Optional(1 bytes)
src parse request position 8: Optional(1 bytes)
src parse request position 9: Optional(1 bytes)
src parse request position 8: Optional(1 bytes)
src parse request position 9: Optional(1 bytes)
  |- <property_declaration range: {0, 5} childCount: 4>
    |- <value_binding_pattern range: {0, 1} childCount: 1>
      |- <let range: {0, 1} childCount: 0>
    |- <pattern range: {2, 1} childCount: 1>
      |- <simple_identifier range: {2, 1} childCount: 0>
    |- <= range: {3, 1} childCount: 0>
    |- <integer_literal range: {4, 1} childCount: 0>

where the range looks wrong in the tree this time.

All I need is a direct binding to ts_parsr_parse without any assumption here

TSTree *ts_parser_parse(
  TSParser *self,
  const TSTree *old_tree,
  TSInput input
);

I'll take a look at how I can do it myself.

Feel free to close this issue. Appreciate the discussion!

lijunsong commented 6 months ago

While we're still on it, I have another question on the byte offset this library provides:

let's assume everything we've discussed works out fine, how to convert changedRanges result (which provides byte offset) back to the swift String/NSString's offset? I don't find efficient way to convert it.

mattmassicotte commented 6 months ago

All I need is a direct binding to ts_parsr_parse without any assumption here

I haven't done any work with UTF-8 encoding, becuase it isn't natively supported by Apple's text systems. But, I would like it make it possible.

However, the node printing code does assume UTF-16 (it is using NSRange). So, dumpTree is giving you incorrect values here. There's no way to change this today, but I think some changes to the Node type could make this possible.

mattmassicotte commented 6 months ago

let's assume everything we've discussed works out fine, how to convert changedRanges result (which provides byte offset) back to the swift String/NSString's offset? I don't find efficient way to convert it.

Converting Range<UInt32> byte ranges back to UTF-16 NSRange can be done with the .range method extension. Converting UTF-8 data is much more complex, which another reason I didn't add built-in support for it.

lijunsong commented 6 months ago

I haven't done any work with UTF-8 encoding, becuase it isn't natively supported by Apple's text systems. But, I would like it make it possible.

Nah, I think it's OK. As long as it's consistent. I made changes such that

  1. InputEdit starts to use str.lengthOfBytes(using: .utf16) so it calculates the bytes in terms of utf16.
  2. and then later in changedRanges result, convert the byte range to NSRange using changeRange.bytes.range to get a consistent view of the underlying string
mattmassicotte commented 6 months ago

Ok great, I'm glad you've got something working!

If you are interfacing directly with the text system via UITextView, you might want to take a look at Neon. Aside from handling all of these details, it also has a lot of high-level features you might like as well.