`char_range()` function for indexing into source strings

davidanthoff commented 1 month ago

We have a crash from the VS Code extension that seems to originate from JuliaSyntax. Repo steps are:

pkg> dev Sunny
Run
```
using JuliaSyntax
```

src = read(joinpath(homedir(), ".julia/dev/Sunny/test/test_tensors.jl"), String)

tree = parseall(SyntaxNode, src, filename="foo.jl")

i = JuliaSyntax.last_byte(tree[2][end][end])

src[i]


That crashes with

ERROR: StringIndexError: invalid index [2142], valid nearby indices [2140]=>'′', [2143]=>'\r'



I assume that `last_byte` should always return a valid index, right? So this seems like a bug.

The rest of this issue is some speculation from my end, might all be wrong :)

I started looking a bit around in the JuliaSyntax code, and there seems to be a fair bit of code that assumes 1-byte characters, which strikes me as incorrect? Or maybe this is carefully only done when there is a guarantee that only 1-byte codepoints can appear? An example is https://github.com/JuliaLang/JuliaSyntax.jl/blob/1d950817b8ab62a3be2464f4d5b097aa68c17042/src/parse_stream.jl#L519 Doesn't that (incorrectly) assume that the character at index position `last_byte` is only one code unit? Generally if I search the repo for `- 1` or `+ 1` I see a fair bit of code where I would have assumed that a `prevind` or `nextind` would be needed?

KristofferC commented 1 month ago

Why would indexing into a string with the last_byte make sense? Seems the same conceptually as doing "α"[sizeof("α")].

davidanthoff commented 1 month ago

Yeah, I also realized, I should probably be using range instead of using first_byte and last_byte. At least for range I assume it should return valid indices, right? My PR would fix that, the repo for the problem with range is:

using JuliaSyntax

src = read(joinpath(homedir(), ".julia/dev/Sunny/test/test_tensors.jl"), String)

tree = parseall(SyntaxNode, src, filename="foo.jl")

code_range = range(tree[2][end][end])

src[code_range]

which also errors.

DaniGlez commented 1 month ago

If anyone else is facing this issue, a way to bypass it until there's a permanent fix is to find your lines ending in non-ASCII Unicode characters and make that no longer the case.

MilesCranmer commented 1 month ago

Thanks @DaniGlez!

Here's a regexp (VSCode flavour) for finding non-ASCII characters that end lines:

[^\x00-\x7f]$

c42f commented 1 month ago

there seems to be a fair bit of code that assumes 1-byte characters

JuliaSyntax works on byte indices internally and it's (usually) quite careful to convert to valid character indices in the occasional cases where this is required.

So unicode generally works and there's a pile of tests for this in test/source_files.jl

However there must be a bug in an edge case for non-ascii at line end

c42f commented 1 month ago

Nice properties of byte ranges are that (a) the green tree exactly covers the source buffer with unit ranges, with no gaps (b) the ranges (in principle) require no further interpreting of unicode chars to pull out a string (not yet exploited in practice! as this bug shows!)

davidanthoff commented 1 month ago

I think all I'm looking for is some function that returns a range that has valid string indices and can be used directly with the underlying string. Does that exist? If I understand @c42f's comment in my PR correct, then Base.range (to be renamed to byte_range) is also not supposed to do that, so is there any other way?

As a user of this, I would also say that at the level of SyntaxNodes, users are probably thinking in terms of string indices, not byte indices. I understand that byte stuff makes sense at the green tree level, but isn't that an abstraction leakage if I have to deal with this stuff when dealing with SyntaxNodes? I think it would be nice if there was a (documented) API at the SyntaxNode level that just operates on string index level.

For now I'm going to hack around all of this downstream, we can't have this bug crash in the wild, but it would be much nicer if there was some "official" way to get these indices.

c42f commented 1 month ago

I think all I'm looking for is some function that returns a range that has valid string indices and can be used directly with the underlying string. Does that exist?

Not exactly. Here's what exists so far:

JuliaSyntax uses SyntaxNode ranges in conjunction with SourceFile which does the index conversion (as well as other useful stuff like line and column number computation).
If you just want the source text covered by the node you can use sourcetext(node).

If I understand @c42f's comment in my PR correct, then Base.range (to be renamed to byte_range) is also not supposed to do that

Correct. By design, JuliaSyntax works in terms of byte ranges and only does the conversion to character indices when necessary.

For now I'm going to hack around all of this downstream

Ok. If you can point me at the place you're using this downstream, I'm sure we can work out the right design to unhack things in the future :-)

davidanthoff commented 1 month ago

My hack is https://github.com/julia-vscode/TestItemDetection.jl/blob/e19b96801a10718ca39b2c15cad18d0aae123993/src/packagedef.jl#L2. Essentially I just want that :)

In terms of design, I generally just guess that "end-users" of this package that use SyntaxNode want to deal with string indices, not byte indices. From the consumer's point of view, the whole byte index stuff strikes me as an implementation detail that I shouldn't really be exposed to in terms of the API I deal with :)

I saw the SourceFile type. We have a very similar thing in JuliaWorkspaces, and it is a bit different now than I originally described. Would be nice if we could unify all of that at some point to use the same types.

c42f commented 1 month ago

Essentially I just want that :)

Right - this is pretty much what SourceFile does here:

https://github.com/JuliaLang/JuliaSyntax.jl/blob/a63e8bb652bba267397ed41aa7cf50aa62beaa63/src/source_files.jl#L114

Except that SourceFile also supports a byte offset (for when we are logically only parsing a piece of the text, not the whole thing). By the way, SourceFile has prevind, etc, supported.

I feel the core issue here is the question "what's a good representation of source code?" SourceFile isn't the whole answer, and neither is a string (probably). Essentially we're back to wanting a proper answer to #190 ?

c42f commented 1 month ago

For now I guess a quick fix would be to add a char_range() function which would do what your our_range() does? It's not wonderful (not a solution to #190), but it would be an improvement.

For GreenNode, char_range would need additional file::SourceFile and position::Int arguments. Or some such.

davidanthoff commented 1 month ago

For now I guess a quick fix would be to add a char_range() function

Yes, I think that would be great.

For GreenNode, char_range would need additional file::SourceFile and position::Int arguments.

I think it would also be reasonable to say that the "char"-based interface is only available at the SyntaxNode level, if someone is digging down to the green node level, they just have to deal with this themselves? That is the more low level interface in any case.

JuliaLang / JuliaSyntax.jl

`char_range()` function for indexing into source strings #457