EWSoftware / VSSpellChecker

A Visual Studio spell checker editor extension that checks the spelling of comments, strings, and plain text as you type. Supports configuration and various languages.
Other
377 stars 65 forks source link

Pasting emoji into string crashes Visual Studio 2019 #218

Closed watfordjc closed 3 years ago

watfordjc commented 4 years ago

Yesterday I was trying to paste an emoji into a text string (to overwrite the img alt Twitter uses when copying text from a Tweet), and every time I pasted the fire engine emoji Visual Studio crashed. After the file recovery process, a banner at the top of the window suggested disabling this extension.

Extension enabled: Visual Studio crashes if I perform the action Extension disabled: Visual Studio doesn't crash if I perform the action

Code Sample

This is the relevant line of my code, although I'm fairly sure the only thing of relevance is the string value:

Interop.UnsafeNativeMethods.DrawTextFromString(canvas, "Blue watch are getting ready to handover to green watch for the night. We will be back at 20:00 tomorrow night. Stay safe 🚒😀🚒", centredTopLeft.X, (int)Math.Round(startY), canvasDimensions.Width - (centredTopLeft.X * 2), canvasDimensions.Height - (int)Math.Round(startY), false, "Segoe UI Emoji", 90.0f, 700, "en-GB", brushes["textBrush"]);

This is the string copied from Twitter [Link to Tweet to comply with TOS] surrounded with double quotes:

"Blue watch are getting ready to handover to green watch for the night. We will be back at 20:00 tomorrow night. Stay safe Fire engineGrinning faceFire engine"

Thoughts

I'm not sure if just pasting 🚒 into a string causes the issue, whether pasting 🚒 over the selected text "Fire engine" causes the issue, or if pasting any emoji causes the issue - I haven't extensively tested it.

I'm also not sure at what point in the pasting/spellchecking process a change in text encoding is involved, but I'm fairly sure the emoji I copied was in UTF-8 (it was from a Web page) and string is UTF-16.

I know with the extension disabled the monochrome emoji is the same colour as string text when pasted (red*) and turns purple* after some delay, so there is also syntax highlighting involved.

* Visual Experience Color Theme: Blue; Editor Color Scheme for C#: Visual Studio 2017

Software

EWSoftware commented 4 years ago

Someone reported the same issue pasting a trash can Unicode character into a string (#216). As with that case, I'm unable to duplicate this in either VS 2017 or VS 2019. The version of Visual Studio and theme don't make a difference as far as I can tell. What language dictionary are you using? It's possible configuration settings could make a difference. Do you recall if you changed any of the configuration settings?

One thing you can try to work around the problem is to go into the configuration settings in the General category and set the Ignored Character Class option to either ignore non-Latin or ignore non-ASCII characters. That should cause it to ignore the emoji characters.

watfordjc commented 4 years ago

Ignore non-Latin stops the crashes.


Configuration Settings

Global Dictionary: en-GB General Settings: everything is Yes except Treat underscores as separators. Ignored character class: include all words. C# Options: everything is No Excluded expression: \[JsonPropertyName\(".*?"\)] Code Analysis Dictionaries: All Yes except casing exceptions. Recognised word handling: Treat all as ignored words.

Research

Potentially Relevant (also analysing what I type, as I type)

Other extensions

Project dependencies

Project Settings

Probably Irrelevant

Windows Language

Looking into how the clipboard works, my Windows settings may or may not be relevant.

When you close the clipboard, if it contains CF_TEXT data but no CF_LOCALE data, the system automatically sets the CF_LOCALE format to the current input language.

The system uses the code page associated with CF_LOCALE to implicitly convert from CF_TEXT to CF_UNICODETEXT.

A Note on Probability

I'm not usually pasting text from external sources into strings in my source code that isn't 7-bit ASCII. My own code is usually in 7-bit ASCII as a hangover from Web dev—character encoding bugs (smart quotes, ISO-8859-1 versus CP 1252, etc.) and keyboard layouts mean I'm more likely to type £ despite having a £ key on my keyboard and declaring my Web pages as utf-8 for years. This paragraph itself contains — as my keyboard doesn't have that key.

When the regex for parsing some IRC log lines wasn't working properly yesterday, I immediately replaced all the IRC control codes in the regex with the Unicode-escaped equivalent, and I think the first time I actually pasted a multi-byte character into Visual Studio was when I was moving all my hard-coded strings into resource editor and found ➡️ {0}… worked whereas \u2701\ufe0f {0}\u2026 didn't.

That is to say that the likelihood of me pasting something to cause this bug was already low. Other than a zero-width space for making URLs non-clickable after auto-parsing (Tweets, YouTube comments, etc.), I don't usually have non-ASCII characters in my clipboard. As someone that mostly does console stuff, I usually write the code to process data before the code for displaying data (the reason my code currently writes so many lines to debug output).

In this specific case I was doing things in reverse order to usual: writing the code for transforming data into something visual before writing the code for processing data. Normally the string wouldn't have been anything more than a variable name or property/return value so couldn't be pasted into – i.e. the current value in uncommitted code is match.Groups[10].Value.

NoelAbrahams commented 3 years ago

Ran into this today. The act of pasting did not crash VS, but when the adjoining text was edited, attempting to save the file caused the crash. The text is in a .json file and the emoji was 😄 (that's the emoji - not me smiling. This is me smiling 😄 )

EWSoftware commented 3 years ago

I was able to do some more testing on this and was finally able to duplicate the issue by using the English UK dictionary rather than the English US dictionary. Part of the problem is that the emojis are being included as part of the preceding or following word. The main failure occurs because NHunspell crashes when getting suggestions for a word containing an emoji character when it's using the UK dictionary.

The solution will be something along the lines of updating the word splitter to check for surrogate pairs in the string being split and if the pair is within the known ranges of emoji characters, split the word there rather than including the emoji character as part of the word.

A workaround for now is to set the Ignored Character Class option to ignore non-Latin or non-ASCII characters. A side effect is that it may miss misspelled words that abut emojis until the fix above is implemented.

EWSoftware commented 3 years ago

This issue has been fixed and a new release is available (v2021.1.23.0).