bblfsh / libuast

Apache License 2.0
20 stars 15 forks source link

feature request: provide Unicode positions #102

Open vmarkovtsev opened 5 years ago

vmarkovtsev commented 5 years ago

As we have recently found out, the positions of UAST nodes must be measured in bytes, not in runes. However, we require working with strings, so we had to build the conversion ourselves. This code is not straightforward. Besides, all kinds of crazy things may happen, e.g. the line number can change.

I suggest adding the function to recalculate positions to Babelfish, either as an API extension or to the clients. It is not cool to calculate them by hand, and they are different from IDE (expected) positions anyway.

bzz commented 5 years ago

Not sure how generally applicable is that across all the drivers, but I know @creachadair rised this concern before and @dennwc did some work on it e.g at js driver, in order to make this happen https://github.com/bblfsh/javascript-driver/pull/51

dennwc commented 5 years ago

This was a reverse operation: Babelfish expects bytes positions, but JS was writing Unicode positions.

The feature request makes total sense, but I think we can't add Unicode positions as a field - it will make UAST files even larger.

So I propose to send the whole source file as a part of UAST. Then we can recalculate Unicode positions and even tokens on demand.

vmarkovtsev commented 5 years ago

It is also possible to add a parameter to the parse request that the client wants Unicode positions. Normally you don't need both at the same time.

dennwc commented 5 years ago

This makes sense. We can extend the parsing protocol with one more option for this. Still, the Position node will need some flag to signal that it's stored as Unicode position.

dennwc commented 5 years ago

As per discussion on Slack (non-public link), this sounds like a "wontfix" for the SDK:

@creachadair: The representation of position is tricky: Even languages that use Unicode internally do not necessarily agree on the distinction between codepoint offset and character. For different normalizations you may well get different offsets, and whether or not the normalization happens implicitly varies.

Byte offset is the only reliable universal baseline, and so the real question IMO is how to get codepoint offsets into the API. Personally, I think it makes sense to put the code for transforming a (bytes, byte-offset) pair into a codepoint-offset, line number, column offset, etc., in a common location. But I do not think it makes sense to plumb a bunch of switches through the APIs for this.

@vmarkovtsev: I wonder if it will help to force a specific normalization form in case Unicode positions are requested - that is, to return a different byte representation which does not have gotchas.

@creachadair: That might help in some cases, but not all: In some cases converting to a particular normal form requires lossy transformations. If you only need to go "forward" (e.g., read the file and answer questions about it) that may be OK. But if you want to apply suggestions or fixes to a file, the positions may no longer match after transformation.

Worse: After normalization the byte-for-byte encoding may change length, which means even the byte offsets get moved.

So: Yes, it's possible to force a specific encoding, lossily, but it's only safe in one direction.

And yes: Unicode is a mess.

vmarkovtsev commented 5 years ago

@dennwc Shall we move this request to the Python client then? As I wrote, somebody has to calculate the positions due to the "business logic" requirements, and I am not happy that this is currently the ML team's responsibility.

I mean, we can say that this is wontfix for clever and legitimate technical reasons (which happen in 0.01% of all the cases though), but then something will break on our side, and the product will be broken, too. cc @smola

dennwc commented 5 years ago

@vmarkovtsev I only meant the SDK. This should probably be in clients or libuast.

vmarkovtsev commented 5 years ago

BTW I am perfectly fine if the positions calculator sometimes responds that it cannot infer Unicode positions reliably (because of crazy normalization, etc.) and returns nil. We could simply skip such complex files at roughly no practical loss.

dennwc commented 5 years ago

The problem is only that we can't provide those positions in the UAST, but we can always calculate them separately. This way the user won't be able to change positions directly and we won't hit cases mentioned by @creachadair.

bzz commented 5 years ago

@dennwc @creachadair how do you think, what would be the best place to move this issue? I would suggest python-client first and then, if it proves to be required, implement the decided logic in libuast for other clients to benefit. WDYT?

creachadair commented 5 years ago

@dennwc @creachadair how do you think, what would be the best place to move this issue? I would suggest python-client first and then, if it proves to be required, implement the decided logic in libuast for other clients to benefit. WDYT?

That seems like a reasonable approach. As long as we're careful not to make the API too specific to Python, we should be able to shift it into the shared library later.

dennwc commented 5 years ago

This will be done in libuast for Python, so it makes sense to move the issue there. Doing it now.

dennwc commented 5 years ago

Not sure why, but I don't have permissions to move it to libuast specifically. Can move to other repositories, though.

@creachadair Can you please check that permissions on SDK, libuast, bblfshd, clients and drivers are the same? Thanks!

creachadair commented 5 years ago

@creachadair Can you please check that permissions on SDK, libuast, bblfshd, clients and drivers are the same? Thanks!

Done. It turns out libuast had Maintainers set to "Write" instead of "Admin". I have now fixed it, and they all three have "Admin" for that group.