Debugger Does Not Handle Unicode/UTF-8 Characters Properly

mdowst commented 6 years ago

System Details

Operating system name and version: Windows Server 2012 R2
VS Code version: 1.24.1
PowerShell extension version: 1.7.0
Output from $PSVersionTable: PSVersion : 5.0.10586.117 PSCompatibleVersions : 1.0 2.0 3.0 4.0 5.0.10586.117 BuildVersion : 10.0.10586.117 CLRVersion : 4.0.30319.42000 WSManStackVersion : 3.0 PSRemotingProtocolVersion : 2.3 SerializationVersion : 1.1.0.1

1.24.1
24f62626b222e9a8313213fb64b10d741a326288
x64

Major  Minor  Build  Revision
-----  -----  -----  --------
1      7      0      0
bierner.markdown-preview-github-styles@0.1.2
formulahendry.code-runner@0.9.3
ms-python.python@2018.6.0
ms-vscode.PowerShell@1.7.0
robertohuertasm.vscode-icons@7.24.0
Tyriar.shell-launcher@0.2.0

Key   : PSVersion
Value : 5.0.10586.117
Name  : PSVersion

Key   : PSCompatibleVersions
Value : {1.0, 2.0, 3.0, 4.0...}
Name  : PSCompatibleVersions

Key   : BuildVersion
Value : 10.0.10586.117
Name  : BuildVersion

Key   : CLRVersion
Value : 4.0.30319.42000
Name  : CLRVersion

Key   : WSManStackVersion
Value : 3.0
Name  : WSManStackVersion

Key   : PSRemotingProtocolVersion
Value : 2.3
Name  : PSRemotingProtocolVersion

Key   : SerializationVersion
Value : 1.1.0.1
Name  : SerializationVersion

Issue Description

I've experienced an issue with the way the debugger handles non-ascii characters. If I create a script with a Unicode/UTF-8 character in, for example the sigma symbol "∑", when I press F5 to run the script through the debugger it translates the symbol like this, "âˆ‘". If I highlight the text and run it using F8, it displays the characters correctly.

I've tested this on 3 different machines, one Windows Server 2012 R2, which I included the system details for here. I also tested it on a Windows Server 2016 with the same versions of VS Code and PowerShell extensions and I saw the same results. However, I also tested on it Windows 10 1709, again with the same versions, and it did not have this issue. The only difference between the systems is the Windows 10 system listed the architecture as ia32 and the two servers are x64. Also the Windows 10 system is on PowerShell version 5.1.15063.1088 and the 2016 is on version 5.1.14393.2248.

Here is an example of the code I am running

$string = "Greek sigma symbol: ∑"
$string

Attached Logs

logs.zip

rjmholt commented 6 years ago

Hi @mdowst, thanks for opening an issue!

We've been aware of an ongoing encoding issue, but it's been tricky to work out where the problem lies. Hopefully this issue will shed some more light.

I have a couple of questions if that's ok:

How did you write/create the script and the ∑ in it originally?
Do you know if the ∑ you used was encoded in UTF-8, UTF-16LE or UTF-16BE? Last time there was an encoding issue, I was able to work out the original encoding from the symbols that came out, but assuming âˆ‘ is a CP1252 rendering of some other codepoints, I'm not sure how that appears from ∑.
What does $OutputEncoding return when entered in the integrated console? (@mklement0 has a really interesting StackOverflow answer about PowerShell's encoding handling, which may have more ideas for us)
Have you set VSCode's "files.encoding" option? If so, what to?
If you save the script on the filesystem and look at the encoding, what is that encoding?

mklement0 commented 6 years ago

To add to @rjmholt's helpful tips:

The likeliest explanation is that your *.ps1 file is saved as UTF-8 without a BOM, which means that Windows PowerShell (unlike PS Core) will interpret it as "ANSI"-encoded and therefore misinterpret it.

The solution is to always use UTF-8 with BOM as the character encoding, as both Windows PowerShell and PowerShell Core interpret that correctly.

The tricky part is that modern editors such as Visual Studio Code create BOM-less UTF-8 files by default, so you have to remember to explicitly change the encoding to UTF-8 with BOM.

mklement0 commented 6 years ago

@rjmholt:

If what I suspect is the true source of the problem, the behavior is not a bug, but by design.

In Windows PowerShell you've always had to either use "ANSI"-encoded source code files (for characters in the system-locale extended-ASCII range only) or one of the standard Unicode encodings with BOM in order for non-ASCII characters in string literals to be recognized correctly.

You can reproduce the problem as follows:

# Note: [IO.File]::WriteAllText() writes UTF-8 files *without BOM* by default.
WinPS> [IO.File]::WriteAllText("$PWD/t.ps1", '"Greek sigma symbol: ∑"'); ./t.ps1
Greek sigma symbol: âˆ‘

The 3 bytes that make up the UTF-8 encoded ∑ char. are misinterpreted as individual "ANSI"-encoded chars.

mdowst commented 6 years ago

@rjmholt, thanks for your reply. I have provided the answers to your questions in-line below.

How did you write/create the script and the ∑ in it originally? Honestly there is no paricular reason I am using the ∑ symbol. I discovered this bug when I was writting another script that downloaded a file from Azure blob, parsed through some of the lines, then outputted to another file. I noticed that some characters in the file where giving these strange results. So I wrote a quick function to convert each character to its ascii value. I noticed on the characters it was giving the strange translation did not have an ascii value. So to test my theory I just grabbed the ∑ symbol because I knew for a fact it was not an ascii character. The orginal files contain company sensitive data, but if you need to see it, I may be able to extract the lines or parts of them with the non-ascii characters.
Do you know if the ∑ you used was encoded in UTF-8, UTF-16LE or UTF-16BE? Last time there was an encoding issue, I was able to work out the original encoding from the symbols that came out, but assuming âˆ‘ is a CP1252 rendering of some other codepoints, I'm not sure how that appears from ∑. I'm sorry I do not know the encoding. I literally copied it from the Wikipedia article https://en.wikipedia.org/wiki/Summation, because I know it was a non-ascii character.

What does $OutputEncoding return when entered in the integrated console? (@mklement0 has a really interesting StackOverflow answer about PowerShell's encoding handling, which may have more ideas for us) It returns 'us-ascii' when running with F5 through the debugger where I see the issue, and when I run using F8 where this does not happen. It also returns the same values on the x64 and the x86. I included the full output below.

IsSingleByte      : True
BodyName          : us-ascii
EncodingName      : US-ASCII
HeaderName        : us-ascii
WebName           : us-ascii
WindowsCodePage   : 1252
IsBrowserDisplay  : False
IsBrowserSave     : False
IsMailNewsDisplay : True
IsMailNewsSave    : True
EncoderFallback   : System.Text.EncoderReplacementFallback
DecoderFallback   : System.Text.DecoderReplacementFallback
IsReadOnly        : True
CodePage          : 20127

Have you set VSCode's "files.encoding" option? If so, what to? I have not set the VCCode's file.encoding option.
If you save the script on the filesystem and look at the encoding, what is that encoding? UTF-8 on all systems.

Please let me know if I can provide any additional information or testing.

rjmholt commented 6 years ago

Ah, I've worked it out. It's the summation character, not capital sigma (I know you said that, but my mind oversimplified 😄).

In UTF-8 that's encoded as 0xE2 0x88 0x91, which corresponds to CP1252 characters âˆ‘.

So as in other scenarios I've seen, the copied glyph is saved as UTF-8 and then PowerShell itself is seeing the bytes as CP1252, causing this problem.

In your scenario, it might be worth trying to set the integrated console's default encoding to UTF-8. I think this should work:

[Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
$PSDefaultParameterValues['*:Encoding'] = 'utf8'

But @mklement0 might have better advice there.

@tylerl0706 I'm thinking we should look into how to make our hosted PowerShell environment default to UTF-8 encoding... I think that might be the issue plaguing EditorServices here

rjmholt commented 6 years ago

Naturally I realise now that @mklement0 was way ahead of me here. But anyway...

Yeah, despite the rest of the work going for BOM-less UTF-8, I guess our default for Windows PowerShell should be UTF-8-with-BOM, and for PS Core we should make an informed decision...

mklement0 commented 6 years ago

@mdowst

If I highlight the text and run it using F8, it displays the characters correctly.

Presumably, that is because it is an in-memory operation based on strings rather than script files using a specific encoding.

when I press F5 to run the script through the debugger it translates the symbol like this, "âˆ‘"

As explained, this happens when the file is UTF-8-encoded but lacks a BOM: You can tell by the status bar in VSCode stating just UTF-8; by contrast, a file with a BOM would state UTF-8 with BOM. If you click on UTF-8, choose Save with Encoding, then select UTF-8 with BOM and re-execute your script, the problem should go away.

@rjmholt:

This is purely a PowerShell engine issue: It is about what character encoding it assumes when it reads a *.ps1 file that lacks a BOM: Windows PowerShell assumes "ANSI", PowerShell Core assumes UTF-8.

Output settings such as $OutputEncoding and [console]::outputEncoding do not matter here.

I guess our default for Windows PowerShell should be UTF-8-with-BOM

Yes, the extension should be configured to default to encoding utf8bom for *.ps1 files, but note that this will only help for new files - existing UTF-8-encoded files without a BOM will continue to be misinterpreted. https://github.com/Microsoft/vscode/issues/19890#issuecomment-329054924 seems to show how to configure this for a given VSCode extension.

PS Core we should make an informed decision...

Defaulting to utf8bom for PS Core as well would make scripts more portable (and bypasses the need to vary the default encoding dynamically), because PS Core reads such files correctly too.

That said, the presence of a BOM on Unix platforms can cause problems when external tools process such files - not sure if that's a real-world concern.

mklement0 commented 6 years ago

And just to clarify: [console]::OutputEncoding / $OutputEncoding are not relevant only with respect to printing to the console.

They do matter when piping data to and from external programs.

To recap from https://github.com/PowerShell/PowerShell/issues/3819#issuecomment-302943793, here's the command needed to make a console fully UTF-8 aware (on Windows):

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding

As an aside: In PowerShell Core, this shouldn't be necessary, but currently (v6.1.0-preview.3) still is on Windows: see https://github.com/PowerShell/PowerShell/issues/7233

rjmholt commented 6 years ago

Just for reference, here are other issues that I think have the same root cause as this one:

@mklement0 as you say, this is an issue with Windows PowerShell and there's no simple way to get around it.

But, as an extension, we should do our best to handle this, or at least pad around it where we can. Spitballing some things we could try doing to improve the situation:

Set the various encoding variables in PowerShell when the powershell process is started or the EditorServrices module is loaded
- Rather than break users, have a prompt explaining the problem that can be "OK"'d to set this
Set VSCode to encode files as UTF-8-BOM when using the PowerShell extension
- Could use a prompt here too
Introduce an encode/decode layer, so that EditorServices handles everything in a single canonical encoding

@tylerl0706, @rkeithhill, @SeeminglyScience, @mklement0 any other ideas here?

Halkcyon commented 6 years ago

@rjmholt

Set VSCode to encode files as UTF-8-BOM when using the PowerShell extension

If that's the direction, I'd prefer a preference key. If Core is identifying encoding properly, it could cause unnecessary overhead there.

rjmholt commented 6 years ago

To be honest, it's not really my true preference. But managing this issue is tricky, since it's clearly behaving pretty badly (not so much in this issue as in others). Hopefully we can open up the discussion a bit, as well as work out where exactly we need to deal with this problem.

mklement0 commented 6 years ago

@TheIncorrigible1:

My guess is that the performance impact (of looking for a BOM and selecting the encoding based on that, assuming that's what you meant) is negligible, but I've since noticed that even just using UTF8-with-BOM files with other editors on Unix platforms is problematic:

gedit and some versions of emacs treat the BOM as data and include it (invisibly) as part of the edit buffer. Only vim seems to be UTF8 BOM-aware.

For that reason alone I now think we should not use a BOM by default when we create new files for PowerShell Core.

For Windows PowerShell, however, we should.

mklement0 commented 6 years ago

@rjmholt: Here are my thoughts on what the extension should and shouldn't do:

As for the default encoding when creating a new PowerShell file:
- As stated, for Windows PowerShell it should be UTF-8 with BOM, for PowerShell Core it should be BOM-less UTF-8.
- Challenges:
  - Technical: Not sure if the mechanism via the settings mentioned above is flexible enough to allow determining the encoding dynamically, based on another setting, namely what PS edition is currently being targeted (powershell.powerShellExePath)
  - UX: Users need to be aware of the differing defaults based on what edition is being targeted, and also need to be aware of what edition is currently being targeted.
  - Development: Someone wanting to write a universal (cross-platform and cross-edition) code, must either use UTF-8 with BOM or use only characters in the 7-bit ASCII range. Note that use of the u{<hex-code-point>} escapes to avoid non-ASCII chars. is not an option, because Windows PowerShell doesn't understand them.
As for the integrated-console experience:
- I think we should replicate the experience that you get by default in the console (terminal) for the currently selected edition with respect to $OutputEncoding and [console]::InputEncoding / [console]::OutputEncoding
  - Users should be instructed to modify the (extension-specific) $PROFILE if they want to change the defaults persistently.
  - As stated, the console defaults for PS Core on Windows need fixing (https://github.com/PowerShell/PowerShell/issues/7233) and the extension should replicate the fixed experience.

Let me know if that makes sense and/or if I missed something.

arundeoy commented 1 year ago

here you can follow the following command to avoid special character doesn't require encoding. The below command is tested for the powershell and bash it solves my issue with special character.

$ git tag -f −a <tag_name> <commit_hash> −m <tag_message>

I even tried encoding but at the end it just work for terminal but not at the core level. @mdowst I hope it may helps

PowerShell / vscode-powershell