Open mdowst opened 6 years ago
Hi @mdowst, thanks for opening an issue!
We've been aware of an ongoing encoding issue, but it's been tricky to work out where the problem lies. Hopefully this issue will shed some more light.
I have a couple of questions if that's ok:
∑
is a CP1252 rendering of some other codepoints, I'm not sure how that appears from ∑.$OutputEncoding
return when entered in the integrated console? (@mklement0 has a really interesting StackOverflow answer about PowerShell's encoding handling, which may have more ideas for us)"files.encoding"
option? If so, what to?To add to @rjmholt's helpful tips:
The likeliest explanation is that your *.ps1
file is saved as UTF-8 without a BOM, which means that Windows PowerShell (unlike PS Core) will interpret it as "ANSI"-encoded and therefore misinterpret it.
The solution is to always use UTF-8 with BOM as the character encoding, as both Windows PowerShell and PowerShell Core interpret that correctly.
The tricky part is that modern editors such as Visual Studio Code create BOM-less UTF-8 files by default, so you have to remember to explicitly change the encoding to UTF-8 with BOM.
@rjmholt:
If what I suspect is the true source of the problem, the behavior is not a bug, but by design.
In Windows PowerShell you've always had to either use "ANSI"-encoded source code files (for characters in the system-locale extended-ASCII range only) or one of the standard Unicode encodings with BOM in order for non-ASCII characters in string literals to be recognized correctly.
You can reproduce the problem as follows:
# Note: [IO.File]::WriteAllText() writes UTF-8 files *without BOM* by default.
WinPS> [IO.File]::WriteAllText("$PWD/t.ps1", '"Greek sigma symbol: ∑"'); ./t.ps1
Greek sigma symbol: ∑
The 3 bytes that make up the UTF-8 encoded ∑
char. are misinterpreted as individual "ANSI"-encoded chars.
@rjmholt, thanks for your reply. I have provided the answers to your questions in-line below.
How did you write/create the script and the ∑ in it originally? Honestly there is no paricular reason I am using the ∑ symbol. I discovered this bug when I was writting another script that downloaded a file from Azure blob, parsed through some of the lines, then outputted to another file. I noticed that some characters in the file where giving these strange results. So I wrote a quick function to convert each character to its ascii value. I noticed on the characters it was giving the strange translation did not have an ascii value. So to test my theory I just grabbed the ∑ symbol because I knew for a fact it was not an ascii character. The orginal files contain company sensitive data, but if you need to see it, I may be able to extract the lines or parts of them with the non-ascii characters.
Do you know if the ∑ you used was encoded in UTF-8, UTF-16LE or UTF-16BE? Last time there was an encoding issue, I was able to work out the original encoding from the symbols that came out, but assuming ∑ is a CP1252 rendering of some other codepoints, I'm not sure how that appears from ∑. I'm sorry I do not know the encoding. I literally copied it from the Wikipedia article https://en.wikipedia.org/wiki/Summation, because I know it was a non-ascii character.
What does $OutputEncoding return when entered in the integrated console? (@mklement0 has a really interesting StackOverflow answer about PowerShell's encoding handling, which may have more ideas for us) It returns 'us-ascii' when running with F5 through the debugger where I see the issue, and when I run using F8 where this does not happen. It also returns the same values on the x64 and the x86. I included the full output below.
IsSingleByte : True
BodyName : us-ascii
EncodingName : US-ASCII
HeaderName : us-ascii
WebName : us-ascii
WindowsCodePage : 1252
IsBrowserDisplay : False
IsBrowserSave : False
IsMailNewsDisplay : True
IsMailNewsSave : True
EncoderFallback : System.Text.EncoderReplacementFallback
DecoderFallback : System.Text.DecoderReplacementFallback
IsReadOnly : True
CodePage : 20127
Have you set VSCode's "files.encoding" option? If so, what to? I have not set the VCCode's file.encoding option.
If you save the script on the filesystem and look at the encoding, what is that encoding? UTF-8 on all systems.
Please let me know if I can provide any additional information or testing.
Ah, I've worked it out. It's the summation character, not capital sigma (I know you said that, but my mind oversimplified 😄).
In UTF-8 that's encoded as 0xE2 0x88 0x91
, which corresponds to CP1252 characters ∑
.
So as in other scenarios I've seen, the copied glyph is saved as UTF-8 and then PowerShell itself is seeing the bytes as CP1252, causing this problem.
In your scenario, it might be worth trying to set the integrated console's default encoding to UTF-8. I think this should work:
[Console]::OutputEncoding = [Text.UTF8Encoding]::UTF8
$PSDefaultParameterValues['*:Encoding'] = 'utf8'
But @mklement0 might have better advice there.
@tylerl0706 I'm thinking we should look into how to make our hosted PowerShell environment default to UTF-8 encoding... I think that might be the issue plaguing EditorServices here
Naturally I realise now that @mklement0 was way ahead of me here. But anyway...
Yeah, despite the rest of the work going for BOM-less UTF-8, I guess our default for Windows PowerShell should be UTF-8-with-BOM, and for PS Core we should make an informed decision...
@mdowst
If I highlight the text and run it using F8, it displays the characters correctly.
Presumably, that is because it is an in-memory operation based on strings rather than script files using a specific encoding.
when I press F5 to run the script through the debugger it translates the symbol like this, "∑"
As explained, this happens when the file is UTF-8-encoded but lacks a BOM: You can tell by the status bar in VSCode stating just UTF-8
; by contrast, a file with a BOM would state UTF-8 with BOM
. If you click on UTF-8
, choose Save with Encoding
, then select UTF-8 with BOM
and re-execute your script, the problem should go away.
@rjmholt:
This is purely a PowerShell engine issue: It is about what character encoding it assumes when it reads a *.ps1
file that lacks a BOM: Windows PowerShell assumes "ANSI", PowerShell Core assumes UTF-8.
Output settings such as $OutputEncoding
and [console]::outputEncoding
do not matter here.
I guess our default for Windows PowerShell should be UTF-8-with-BOM
Yes, the extension should be configured to default to encoding utf8bom
for *.ps1
files, but note that this will only help for new files - existing UTF-8-encoded files without a BOM will continue to be misinterpreted.
https://github.com/Microsoft/vscode/issues/19890#issuecomment-329054924 seems to show how to configure this for a given VSCode extension.
PS Core we should make an informed decision...
Defaulting to utf8bom
for PS Core as well would make scripts more portable (and bypasses the need to vary the default encoding dynamically), because PS Core reads such files correctly too.
That said, the presence of a BOM on Unix platforms can cause problems when external tools process such files - not sure if that's a real-world concern.
And just to clarify: [console]::OutputEncoding
/ $OutputEncoding
are not relevant only with respect to printing to the console.
They do matter when piping data to and from external programs.
To recap from https://github.com/PowerShell/PowerShell/issues/3819#issuecomment-302943793, here's the command needed to make a console fully UTF-8 aware (on Windows):
$OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding
As an aside: In PowerShell Core, this shouldn't be necessary, but currently (v6.1.0-preview.3) still is on Windows: see https://github.com/PowerShell/PowerShell/issues/7233
Just for reference, here are other issues that I think have the same root cause as this one:
@mklement0 as you say, this is an issue with Windows PowerShell and there's no simple way to get around it.
But, as an extension, we should do our best to handle this, or at least pad around it where we can. Spitballing some things we could try doing to improve the situation:
@tylerl0706, @rkeithhill, @SeeminglyScience, @mklement0 any other ideas here?
@rjmholt
Set VSCode to encode files as UTF-8-BOM when using the PowerShell extension
If that's the direction, I'd prefer a preference key. If Core is identifying encoding properly, it could cause unnecessary overhead there.
To be honest, it's not really my true preference. But managing this issue is tricky, since it's clearly behaving pretty badly (not so much in this issue as in others). Hopefully we can open up the discussion a bit, as well as work out where exactly we need to deal with this problem.
@TheIncorrigible1:
My guess is that the performance impact (of looking for a BOM and selecting the encoding based on that, assuming that's what you meant) is negligible, but I've since noticed that even just using UTF8-with-BOM files with other editors on Unix platforms is problematic:
gedit
and some versions of emacs
treat the BOM as data and include it (invisibly) as part of the edit buffer. Only vim
seems to be UTF8 BOM-aware.
For that reason alone I now think we should not use a BOM by default when we create new files for PowerShell Core.
For Windows PowerShell, however, we should.
@rjmholt: Here are my thoughts on what the extension should and shouldn't do:
As for the default encoding when creating a new PowerShell file:
As stated, for Windows PowerShell it should be UTF-8 with BOM, for PowerShell Core it should be BOM-less UTF-8.
Challenges:
Technical: Not sure if the mechanism via the settings mentioned above is flexible enough to allow determining the encoding dynamically, based on another setting, namely what PS edition is currently being targeted (powershell.powerShellExePath
)
UX: Users need to be aware of the differing defaults based on what edition is being targeted, and also need to be aware of what edition is currently being targeted.
Development: Someone wanting to write a universal (cross-platform and cross-edition) code, must either use UTF-8 with BOM or use only characters in the 7-bit ASCII range. Note that use of the u{<hex-code-point>}
escapes to avoid non-ASCII chars. is not an option, because Windows PowerShell doesn't understand them.
As for the integrated-console experience:
$OutputEncoding
and [console]::InputEncoding
/ [console]::OutputEncoding
$PROFILE
if they want to change the defaults persistently.Let me know if that makes sense and/or if I missed something.
here you can follow the following command to avoid special character doesn't require encoding. The below command is tested for the powershell and bash it solves my issue with special character.
$ git tag -f −a <tag_name> <commit_hash> −m <tag_message>
I even tried encoding but at the end it just work for terminal but not at the core level. @mdowst I hope it may helps
System Details
$PSVersionTable
: PSVersion : 5.0.10586.117 PSCompatibleVersions : 1.0 2.0 3.0 4.0 5.0.10586.117 BuildVersion : 10.0.10586.117 CLRVersion : 4.0.30319.42000 WSManStackVersion : 3.0 PSRemotingProtocolVersion : 2.3 SerializationVersion : 1.1.0.1Issue Description
I've experienced an issue with the way the debugger handles non-ascii characters. If I create a script with a Unicode/UTF-8 character in, for example the sigma symbol "∑", when I press F5 to run the script through the debugger it translates the symbol like this, "∑". If I highlight the text and run it using F8, it displays the characters correctly.
I've tested this on 3 different machines, one Windows Server 2012 R2, which I included the system details for here. I also tested it on a Windows Server 2016 with the same versions of VS Code and PowerShell extensions and I saw the same results. However, I also tested on it Windows 10 1709, again with the same versions, and it did not have this issue. The only difference between the systems is the Windows 10 system listed the architecture as ia32 and the two servers are x64. Also the Windows 10 system is on PowerShell version 5.1.15063.1088 and the 2016 is on version 5.1.14393.2248.
Here is an example of the code I am running
Attached Logs
logs.zip