Incorrect handling of multibyte UTF-16 encodings

Vtec234 commented 4 years ago

Describe the bug It seems the handling of multibyte UTF-16 encodings is incorrect in lsp-mode.

To Reproduce Create a file with just these contents: "🍋" - a single lemon emoji. Place the cursor after the lemon and type a character ("l", say, for lemon). Most emojis including this one are represented by two UTF-16 bytes, so since LSP specifies offsets as in a UTF-16 string representation, this is at column 2.

But lsp-mode sends column 1:

{"jsonrpc":"2.0","method":"textDocument/didChange","params":{"textDocument":{"uri":"file:///home/w/utf16.lean","version":1},"contentChanges":[{"range":{"start":{"line":0,"character":1},"end":{"line":0,"character":1}},"rangeLength":0,"text":"l"}]}}

Expected behavior Compare with e.g. the VSCode sample client, which sends 2 as it should:

{"jsonrpc":"2.0","method":"textDocument/didChange","params":{"textDocument":{"uri":"file:///home/w/utf16.lean","version":56},"contentChanges":[{"range":{"start":{"line":0,"character":2},"end":{"line":0,"character":2}},"rangeLength":0,"text":"l"}]}}

Which Language Server did you use Custom one added via the tutorial. lsp-mode version 7.0.1.

OS Linux

matklad commented 3 years ago

We are hitting this in rust-analyzer: https://github.com/rust-analyzer/rust-analyzer/issues/4263#issuecomment-762800167.

Out of curiosity, how hard would it be for Emacs to report offsets in ut8 coordinate space (I believe Emacs uses that internally)? I'd be willing to implement a protocol extension to use utf-8 offsets throughout.

I could, but I would be reluctant to implement unicode codepoints coordinates: utt8 skips coordinates transcoding altogether, while unicode would just replace "map utf8 to utf16" with "map utf8 to unicode", which imo isn't meaningfully different.

yyoncho commented 3 years ago

I don't think it will be hard to implement. We havennt fixed that issue due to the fact that it has a relatively low impact and it will affect performance.

bkchr commented 3 years ago

Currently the problem for me is, that this makes rust-analyzer panic :D

yyoncho commented 3 years ago

So, the priority of the issue now is bigger.

bkchr commented 3 years ago

@yyoncho could you maybe give me some pointer into the code? So, I could try to fix this issue?

yyoncho commented 3 years ago

@bkchr you have to fix both functions lsp--position-to-point and lsp--point-to-position. Also, AFAIK eglot has that function implemented.

nbfalcon commented 3 years ago

As a side note, clangd supports UTF-8 directly, if a special option is specified.

matklad commented 3 years ago

@nbfalcon thanks, I didn't realize that the extension already exists. I actually would prefer to fix this on the server-side than, to create social pressure to actually officially adopt that into the protocol.

EDIT: https://github.com/rust-analyzer/rust-analyzer/issues/7453

nbfalcon commented 3 years ago

Here is the relevant source. @matklad I agree that this would be great, since many servers/editors/libraries assume UTF-8 (well, everything aside from VSCode and JavaScript). There is an upstream bug about this already: https://github.com/microsoft/language-server-protocol/issues/376.

yyoncho commented 3 years ago

@bkchr

@yyoncho could you maybe give me some pointer into the code? So, I could try to fix this issue?

Did you start working on that?

bkchr commented 3 years ago

No

yyoncho commented 3 years ago

Ok, I will take a look.

yyoncho commented 3 years ago

@bkchr - pushed a proposed fix at - https://github.com/yyoncho/lsp-mode/commit/aac9b6a10611ce9fcb4c9fe272f2835f0b6bb210

bkchr commented 3 years ago

Nice ty

petr-tik commented 3 years ago

As a side note, clangd supports UTF-8 directly, if a special option is specified.

Now that we have integration tests for clangd-9 on linux, this could be added as a regression test that adds offsetEncoding: ["utf-8"] in clientCapabilities during initialization

itamarst commented 3 years ago

Hi, what's the status of this? Just had rust-analyzer crash due to adding emoji in my code (on lsp-mode from a couple weeks ago).

matklad commented 3 years ago

On rust-analyzer's side, we fully implemented clangd's extension for UTF-8 offsets, so, on Emacs side you an either:

do utf-16 translation, which needs some code and CPU time
expose underlying byte offsets at zero cost (I believe Emacs internally uses something sufficiently close to utf-8 for this to work)

yyoncho commented 3 years ago

@itamarst the fix proposed at https://github.com/yyoncho/lsp-mode/commit/aac9b6a10611ce9fcb4c9fe272f2835f0b6bb210 should work(after rebasing).

acowley commented 3 years ago

The patch @yyoncho links there seems to work for me with both clangd and rust-analyzer.

yyoncho commented 3 years ago

I have to benchmark it and if it is slow at least allow using utf8 when server supports it

wagk commented 2 years ago

Hi, I'm running into this issue right now, and I'm wondering what is currently blocking the patch from being merged into master?

scohen commented 1 year ago

@yyoncho Sorry for the ping, I've been working on the elixir language server, and just found this issue.

If the reason it hasn't been merged is solely due to performance, then I have a suggestion. You don't need to encode every character that you visit, you merely need to encode non-ascii characters (those that are >= 128) as you find them. Given the fact that the vast, vast majority of source code is ascii text, this shouldn't be a problem perf-wise.

If you haven't merged the fix for some other reason, please ping me, this is holding me up and causes the elixir-ls to produce incorrect results under lsp-mode. Utf8 on the server isn't really a fix, since older clients, windows and the lsp spec require utf-16.

Also, thanks for lsp-mode! It's great.

matklad commented 1 year ago

the lsp spec require utf-16.

These days, LSP spec allows utf8 support.

It also formally requires utf16 support, but that seems like a bad design decision in the original protocol, so I personally would be quite happy with server/clients supporting only utf8 and pushing everyone else to do so =P

scohen commented 1 year ago

Allowing utf8 and requiring utf16 aren't mutually exclusive 😉 Agreed about utf16 being a bad decision, but that's what the spec says compliant clients need to support, and you should. I don't see a reason to hold off other than performance, and I'll be glad to work with anyone to make the perf impact tiny. Indeed, if you don't reallocate the entire buffer, and only reallocate single multibyte characters when you encounter them, there will likely be no performance impact at all for most documents.
the fix is almost there, we can get it over the line

scohen commented 1 year ago

personally would be quite happy with server/clients supporting only utf8 and pushing everyone else to do so

To be quite clear, such servers wouldn't be compliant with the lsp spec. Maybe a future version would remove utf16 support, but for now, servers are stuck handling both.

"This is the default and must always be supported by servers". It's wildly annoying, but we (server contributors) have to support it. Can you please make our job a little easier?

scohen commented 1 year ago

@yyoncho can we have a resolution for this? This bug has been open for two years, and you have a fix for it that needs a tiny optimization. I’m willing to help get you there, but I don’t know elisp very well. I’d be willing to help via zoom chat, pop in to irc, submit pseudocode, anything.

Failing that, you should mark this as wontfix, and throw an exception when multibyte characters are encountered in a file when the server is in utf-16 mode, as well as clearly indicating that lsp-mode doesn’t support utf-16 ranges.

The current implementation is broken and causes emacs to produce invalid code. This is not a tenable situation.

teor2345 commented 1 year ago

I just ran into this issue today, it still crashes rust-analyzer.

Can we get at least one of these fixes applied?

correct offsets: https://github.com/yyoncho/lsp-mode/commit/aac9b6a10611ce9fcb4c9fe272f2835f0b6bb210
declaring UTF-8 lsp mode: https://github.com/emacs-lsp/lsp-mode/issues/3344#issuecomment-1320677622

flying-sheep commented 1 year ago

A workaround exists now with https://github.com/emacs-lsp/lsp-mode/pull/3958

For spec compliance, utf-16 offset support still has to be implemented (this issue is not fixed)
If emacs internally uses a utf-8 buffer, utf-8 offset support would have the best performance with language servers that also use utf-8 buffers

emacs-lsp / lsp-mode

Incorrect handling of multibyte UTF-16 encodings #2080