PowerShell / vscode-powershell

Provides PowerShell language and debugging support for Visual Studio Code
https://marketplace.visualstudio.com/items/ms-vscode.PowerShell
MIT License
1.72k stars 492 forks source link

Debugger Does Not Handle Unicode/UTF-8 Characters Properly #1392

Open mdowst opened 6 years ago

mdowst commented 6 years ago

System Details

1.24.1
24f62626b222e9a8313213fb64b10d741a326288
x64

Major  Minor  Build  Revision
-----  -----  -----  --------
1      7      0      0
bierner.markdown-preview-github-styles@0.1.2
formulahendry.code-runner@0.9.3
ms-python.python@2018.6.0
ms-vscode.PowerShell@1.7.0
robertohuertasm.vscode-icons@7.24.0
Tyriar.shell-launcher@0.2.0

Key   : PSVersion
Value : 5.0.10586.117
Name  : PSVersion

Key   : PSCompatibleVersions
Value : {1.0, 2.0, 3.0, 4.0...}
Name  : PSCompatibleVersions

Key   : BuildVersion
Value : 10.0.10586.117
Name  : BuildVersion

Key   : CLRVersion
Value : 4.0.30319.42000
Name  : CLRVersion

Key   : WSManStackVersion
Value : 3.0
Name  : WSManStackVersion

Key   : PSRemotingProtocolVersion
Value : 2.3
Name  : PSRemotingProtocolVersion

Key   : SerializationVersion
Value : 1.1.0.1
Name  : SerializationVersion

Issue Description

I've experienced an issue with the way the debugger handles non-ascii characters. If I create a script with a Unicode/UTF-8 character in, for example the sigma symbol "∑", when I press F5 to run the script through the debugger it translates the symbol like this, "∑". If I highlight the text and run it using F8, it displays the characters correctly.

I've tested this on 3 different machines, one Windows Server 2012 R2, which I included the system details for here. I also tested it on a Windows Server 2016 with the same versions of VS Code and PowerShell extensions and I saw the same results. However, I also tested on it Windows 10 1709, again with the same versions, and it did not have this issue. The only difference between the systems is the Windows 10 system listed the architecture as ia32 and the two servers are x64. Also the Windows 10 system is on PowerShell version 5.1.15063.1088 and the 2016 is on version 5.1.14393.2248.

Here is an example of the code I am running

$string = "Greek sigma symbol: ∑"
$string

Attached Logs

logs.zip

rjmholt commented 6 years ago

Hi @mdowst, thanks for opening an issue!

We've been aware of an ongoing encoding issue, but it's been tricky to work out where the problem lies. Hopefully this issue will shed some more light.

I have a couple of questions if that's ok:

mklement0 commented 6 years ago

To add to @rjmholt's helpful tips:

The likeliest explanation is that your *.ps1 file is saved as UTF-8 without a BOM, which means that Windows PowerShell (unlike PS Core) will interpret it as "ANSI"-encoded and therefore misinterpret it.

The solution is to always use UTF-8 with BOM as the character encoding, as both Windows PowerShell and PowerShell Core interpret that correctly.

The tricky part is that modern editors such as Visual Studio Code create BOM-less UTF-8 files by default, so you have to remember to explicitly change the encoding to UTF-8 with BOM.

mklement0 commented 6 years ago

@rjmholt:

If what I suspect is the true source of the problem, the behavior is not a bug, but by design.

In Windows PowerShell you've always had to either use "ANSI"-encoded source code files (for characters in the system-locale extended-ASCII range only) or one of the standard Unicode encodings with BOM in order for non-ASCII characters in string literals to be recognized correctly.

You can reproduce the problem as follows:

# Note: [IO.File]::WriteAllText() writes UTF-8 files *without BOM* by default.
WinPS> [IO.File]::WriteAllText("$PWD/t.ps1", '"Greek sigma symbol: ∑"'); ./t.ps1
Greek sigma symbol: ∑

The 3 bytes that make up the UTF-8 encoded char. are misinterpreted as individual "ANSI"-encoded chars.

mdowst commented 6 years ago

@rjmholt, thanks for your reply. I have provided the answers to your questions in-line below.

Please let me know if I can provide any additional information or testing.

rjmholt commented 6 years ago

Ah, I've worked it out. It's the summation character, not capital sigma (I know you said that, but my mind oversimplified 😄).

In UTF-8 that's encoded as 0xE2 0x88 0x91, which corresponds to CP1252 characters ∑.

So as in other scenarios I've seen, the copied glyph is saved as UTF-8 and then PowerShell itself is seeing the bytes as CP1252, causing this problem.

In your scenario, it might be worth trying to set the integrated console's default encoding to UTF-8. I think this should work:

[Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
$PSDefaultParameterValues['*:Encoding'] = 'utf8'

But @mklement0 might have better advice there.

@tylerl0706 I'm thinking we should look into how to make our hosted PowerShell environment default to UTF-8 encoding... I think that might be the issue plaguing EditorServices here

rjmholt commented 6 years ago

Naturally I realise now that @mklement0 was way ahead of me here. But anyway...

Yeah, despite the rest of the work going for BOM-less UTF-8, I guess our default for Windows PowerShell should be UTF-8-with-BOM, and for PS Core we should make an informed decision...

mklement0 commented 6 years ago

@mdowst

If I highlight the text and run it using F8, it displays the characters correctly.

Presumably, that is because it is an in-memory operation based on strings rather than script files using a specific encoding.

when I press F5 to run the script through the debugger it translates the symbol like this, "∑"

As explained, this happens when the file is UTF-8-encoded but lacks a BOM: You can tell by the status bar in VSCode stating just UTF-8; by contrast, a file with a BOM would state UTF-8 with BOM. If you click on UTF-8, choose Save with Encoding, then select UTF-8 with BOM and re-execute your script, the problem should go away.

@rjmholt:

This is purely a PowerShell engine issue: It is about what character encoding it assumes when it reads a *.ps1 file that lacks a BOM: Windows PowerShell assumes "ANSI", PowerShell Core assumes UTF-8.

Output settings such as $OutputEncoding and [console]::outputEncoding do not matter here.

I guess our default for Windows PowerShell should be UTF-8-with-BOM

Yes, the extension should be configured to default to encoding utf8bom for *.ps1 files, but note that this will only help for new files - existing UTF-8-encoded files without a BOM will continue to be misinterpreted. https://github.com/Microsoft/vscode/issues/19890#issuecomment-329054924 seems to show how to configure this for a given VSCode extension.

PS Core we should make an informed decision...

Defaulting to utf8bom for PS Core as well would make scripts more portable (and bypasses the need to vary the default encoding dynamically), because PS Core reads such files correctly too.

That said, the presence of a BOM on Unix platforms can cause problems when external tools process such files - not sure if that's a real-world concern.

mklement0 commented 6 years ago

And just to clarify: [console]::OutputEncoding / $OutputEncoding are not relevant only with respect to printing to the console.

They do matter when piping data to and from external programs.

To recap from https://github.com/PowerShell/PowerShell/issues/3819#issuecomment-302943793, here's the command needed to make a console fully UTF-8 aware (on Windows):

$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding

As an aside: In PowerShell Core, this shouldn't be necessary, but currently (v6.1.0-preview.3) still is on Windows: see https://github.com/PowerShell/PowerShell/issues/7233

rjmholt commented 6 years ago

Just for reference, here are other issues that I think have the same root cause as this one:

@mklement0 as you say, this is an issue with Windows PowerShell and there's no simple way to get around it.

But, as an extension, we should do our best to handle this, or at least pad around it where we can. Spitballing some things we could try doing to improve the situation:

@tylerl0706, @rkeithhill, @SeeminglyScience, @mklement0 any other ideas here?

Halkcyon commented 6 years ago

@rjmholt

Set VSCode to encode files as UTF-8-BOM when using the PowerShell extension

If that's the direction, I'd prefer a preference key. If Core is identifying encoding properly, it could cause unnecessary overhead there.

rjmholt commented 6 years ago

To be honest, it's not really my true preference. But managing this issue is tricky, since it's clearly behaving pretty badly (not so much in this issue as in others). Hopefully we can open up the discussion a bit, as well as work out where exactly we need to deal with this problem.

mklement0 commented 6 years ago

@TheIncorrigible1:

My guess is that the performance impact (of looking for a BOM and selecting the encoding based on that, assuming that's what you meant) is negligible, but I've since noticed that even just using UTF8-with-BOM files with other editors on Unix platforms is problematic:

gedit and some versions of emacs treat the BOM as data and include it (invisibly) as part of the edit buffer. Only vim seems to be UTF8 BOM-aware.

For that reason alone I now think we should not use a BOM by default when we create new files for PowerShell Core.

For Windows PowerShell, however, we should.

mklement0 commented 6 years ago

@rjmholt: Here are my thoughts on what the extension should and shouldn't do:

Let me know if that makes sense and/or if I missed something.

arundeoy commented 1 year ago

here you can follow the following command to avoid special character doesn't require encoding. The below command is tested for the powershell and bash it solves my issue with special character.

$ git tag -f −a <tag_name> <commit_hash> −m <tag_message>

I even tried encoding but at the end it just work for terminal but not at the core level. @mdowst I hope it may helps