Open vmarkovtsev opened 5 years ago
Not sure how generally applicable is that across all the drivers, but I know @creachadair rised this concern before and @dennwc did some work on it e.g at js driver, in order to make this happen https://github.com/bblfsh/javascript-driver/pull/51
This was a reverse operation: Babelfish expects bytes positions, but JS was writing Unicode positions.
The feature request makes total sense, but I think we can't add Unicode positions as a field - it will make UAST files even larger.
So I propose to send the whole source file as a part of UAST. Then we can recalculate Unicode positions and even tokens on demand.
It is also possible to add a parameter to the parse request that the client wants Unicode positions. Normally you don't need both at the same time.
This makes sense. We can extend the parsing protocol with one more option for this. Still, the Position
node will need some flag to signal that it's stored as Unicode position.
As per discussion on Slack (non-public link), this sounds like a "wontfix" for the SDK:
@creachadair: The representation of position is tricky: Even languages that use Unicode internally do not necessarily agree on the distinction between codepoint offset and character. For different normalizations you may well get different offsets, and whether or not the normalization happens implicitly varies.
Byte offset is the only reliable universal baseline, and so the real question IMO is how to get codepoint offsets into the API. Personally, I think it makes sense to put the code for transforming a (bytes, byte-offset) pair into a codepoint-offset, line number, column offset, etc., in a common location. But I do not think it makes sense to plumb a bunch of switches through the APIs for this.
@vmarkovtsev: I wonder if it will help to force a specific normalization form in case Unicode positions are requested - that is, to return a different byte representation which does not have gotchas.
@creachadair: That might help in some cases, but not all: In some cases converting to a particular normal form requires lossy transformations. If you only need to go "forward" (e.g., read the file and answer questions about it) that may be OK. But if you want to apply suggestions or fixes to a file, the positions may no longer match after transformation.
Worse: After normalization the byte-for-byte encoding may change length, which means even the byte offsets get moved.
So: Yes, it's possible to force a specific encoding, lossily, but it's only safe in one direction.
And yes: Unicode is a mess.
@dennwc Shall we move this request to the Python client then? As I wrote, somebody has to calculate the positions due to the "business logic" requirements, and I am not happy that this is currently the ML team's responsibility.
I mean, we can say that this is wontfix for clever and legitimate technical reasons (which happen in 0.01% of all the cases though), but then something will break on our side, and the product will be broken, too. cc @smola
@vmarkovtsev I only meant the SDK. This should probably be in clients or libuast
.
BTW I am perfectly fine if the positions calculator sometimes responds that it cannot infer Unicode positions reliably (because of crazy normalization, etc.) and returns nil
. We could simply skip such complex files at roughly no practical loss.
The problem is only that we can't provide those positions in the UAST, but we can always calculate them separately. This way the user won't be able to change positions directly and we won't hit cases mentioned by @creachadair.
@dennwc @creachadair how do you think, what would be the best place to move this issue? I would suggest python-client first and then, if it proves to be required, implement the decided logic in libuast for other clients to benefit. WDYT?
@dennwc @creachadair how do you think, what would be the best place to move this issue? I would suggest python-client first and then, if it proves to be required, implement the decided logic in libuast for other clients to benefit. WDYT?
That seems like a reasonable approach. As long as we're careful not to make the API too specific to Python, we should be able to shift it into the shared library later.
This will be done in libuast
for Python, so it makes sense to move the issue there. Doing it now.
Not sure why, but I don't have permissions to move it to libuast
specifically. Can move to other repositories, though.
@creachadair Can you please check that permissions on SDK, libuast, bblfshd, clients and drivers are the same? Thanks!
@creachadair Can you please check that permissions on SDK, libuast, bblfshd, clients and drivers are the same? Thanks!
Done. It turns out libuast
had Maintainers set to "Write" instead of "Admin". I have now fixed it, and they all three have "Admin" for that group.
As we have recently found out, the positions of UAST nodes must be measured in bytes, not in runes. However, we require working with strings, so we had to build the conversion ourselves. This code is not straightforward. Besides, all kinds of crazy things may happen, e.g. the line number can change.
I suggest adding the function to recalculate positions to Babelfish, either as an API extension or to the clients. It is not cool to calculate them by hand, and they are different from IDE (expected) positions anyway.