Make Build-Module psm1 encoding to UTF8NoBOM on PS5.1

gaelcolas commented 4 years ago

Hi @Jaykul , The module creates the PSM1 as UTF8 no BOM in PS6+ but as UTF8 (with BOM) in WinPS 5.1. Could we update the code here: https://github.com/PoshCode/ModuleBuilder/blob/master/Source/Private/SetModuleContent.ps1#L38-L40

So we can Set the file content without the BOM in WinPS5.1? I guess that you have some specific requirements here for perf, so what approach you think would work best?

jborean93 commented 4 years ago

Sounds like the BOM is being set on Windows PS as without the BOM WinPS reads a ps file in ASCII encoding which breaks non-ASCII chars.

jborean93 commented 4 years ago

Just to make sure I'm clear what I mean, this is what happens when you deal with non-ascii chars in a script that has been written with UTF-8 encoding.

Run the following to create 2 PowerShell scripts that are encoded with UTF-8, one with and one without the UTF-8 BOM. The scripts themselves just contain Write-Host é where é bytes in UTF-8 is 0xC3 0xA9.

# Convert from the UTF-32 codepoint to ensure the console cp does not mess us the input for testing.
$script = "Write-Host {0}" -f ([Char]::ConvertFromUtf32(0x000000e9))
$utf8_bytes = [System.Text.Encoding]::UTF8.GetBytes($script)

# Create a file with and without a BOM, use raw bytes due to differences in Set-Content between ps versions
$utf8_bytes | Set-Content -Path C:\temp\utf8-no-bom.ps1 -Encoding Byte
[System.Text.Encoding]::UTF8.GetPreamble() + $utf8_bytes | Set-Content -Path C:\temp\utf8-with-bom.ps1 -Encoding Byte

# Output the raw bytes to prove a BOM exists (the with-bom file with have 239, 187, 191 at the start
Get-Content -Path C:\temp\utf8-no-bom.ps1 -Encoding Byte
Get-Content -Path C:\temp\utf8-with-bom.ps1 -Encoding Byte

Here is what happens when you run the scripts in WinPS:

PS C:\Users\vagrant> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      5.1.17763.771
PSEdition                      Desktop
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0...}
BuildVersion                   10.0.17763.771
CLRVersion                     4.0.30319.42000
WSManStackVersion              3.0
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1

PS C:\Users\vagrant> C:\temp\utf8-no-bom.ps1
Ã©
PS C:\Users\vagrant> C:\temp\utf8-with-bom.ps1
é

We can see that without a BOM, the file is read with the encoding of the OEM codepage which in my case is windows-1252. In this encoding the bytes 0xC3 0xA9 now become Ã©. WinPS only reads the file with UTF-8 encoding if the BOM is present otherwise it uses the default encoding of the host.

Now when you run it in PSCore

PS C:\Users\vagrant> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.0.0-preview.6
PSEdition                      Core
GitCommitId                    7.0.0-preview.6
OS                             Microsoft Windows 10.0.17763
Platform                       Win32NT
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

PS C:\Users\vagrant> C:\temp\utf8-no-bom.ps1
é
PS C:\Users\vagrant> C:\temp\utf8-with-bom.ps1
é

We can see that for both times, the file is read with UTF-8 encoding, even without the BOM.

What this means in the context of ModuleBuilder I'm not sure. To me it sounds like we should always have the psm1 with the BOM so that if it is built on PSCore with non-ascii chars, it can still be read correctly on WinPS but maybe I am missing something here.

Jaykul commented 4 years ago

I basically agree with that. It's clearly safest to always write the UTF-8 BOM ... and I've tested: this is a problem when setting the content of a module with UTF8NoBom in PS6 and importing it in WindowsPowerShell.

On top of that, there is no "UTF8NoBOM" encoding for Windows PowerShell, so there's no way to avoid the BOM with the cmdlets in Windows PowerShell -- we would have to resort to using raw .NET APIs ...

Probably the best option is to change this to always default to UTF8 for consistency: https://github.com/PoshCode/ModuleBuilder/blob/92b7253f0c00b1cb5ef914016946ce76d2edab49/Source/Private/SetModuleContent.ps1#L31

Jaykul commented 4 years ago

So, after doing a little testing, it's clear to me that this problem in Windows PowerShell (using ascii by default, instead of UTF8) combined with the decision in PowerShell Core that "UTF8" now means "UTF8NoBOM" by default ... means that you really have no choice but to build in Windows PowerShell @gaelcolas 😉

That is, currently, if you use special characters in your code and build the module on Windows PowerShell, it will work everywhere, but if you build it on PowerShell 6+ it will only work on PowerShell 6+ because they'll leave off the BOM when we call Set-Content, producing a module that doesn't import correctly in Windows powerShell.

These kinds of changes are just so frustrating to me. I understand and almost agree with the decision to add pseudo encodings with "BOM" and "NoBom" -- but I don't understand how people decide to change the meaning of UTF8 so we get different (incompatible) results.

I think this means that we need to do the opposite of what I suggested above!

We need to change the outer, public function:

https://github.com/PoshCode/ModuleBuilder/blob/92b7253f0c00b1cb5ef914016946ce76d2edab49/Source/Public/Build-Module.ps1#L111-L114

First, we should remove (or change) the ValidateSet, and then we should set the default such that by default modules built with ModuleBuilder will work everywhere, meaning we explicitly add the BOM on PowerShell 6+

[string]$Encoding = $(if($IsCoreCLR) { "UTF8Bom" } else { "UTF8" })

If we want to pursue UTF8NoBom on Windows PowerShell, we would need to rewrite SetModuleContent based on the implementation of SetContentCommand in PS7. Right now, SetModuleContent is just a proxy function for Set-Content, so we can't pass those pseudo encodings that aren't supported. Our function pretends to pipe the content each of the files passed in SourceFile (it does a Get-Content on them without specifying a source encoding)) and then calls .Process on the Set-Content proxy.

In order to support the pseduo encodings, we would need to manually open and write to a TextStream the way SetContentCommand does. I don't think that would negatively impact performance (it might even be faster than the proxy function), so I don't object to the idea, but I think we should put in a quick fix with the default encoding.

@gaelcolas does that make sense?

jborean93 commented 4 years ago

using ascii by default, instead of UTF8

It's arguable even worse than that, strict ASCII only has 128 code points as it is a 7-bit character encoding. What WinPS actually does in the case where a BOM is not set, is to use the encoding set by the System locale or the "Language for non-Unicode programs" setting in Windows. This is an encoding that fits under the Extended-ASCII umbrella where the first 128 chars are the same as ASCII but the next 128 chars (8th bit) are specific to the encoding specified. Using my example of Write-Host é, character é is now read as Г© instead of Ã© on an English US setting. So not only is the character wrong from what the author intended but the actual output is dependent on the OS configuration. As an FYI Get-WinSystemLocale gets the value for this.

but I don't understand how people decide to change the meaning of UTF8 so we get different (incompatible) results.

This is because the recommendation from the Unicode spec is to SHOULD NOT (not MUST NOT) use a BOM with UTF-8.

... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

Unicode 5.0 - 2.6 - Encoding Schemes

This has been the prevalent thought on Unix for years and having a BOM for UTF-8 may cause troubles for some applications that may not expect it (although I haven't seen this before). For PSCore it shouldn't matter as it will read the file with the BOM just fine. On Windows this is another matter, historically a BOM is used for UTF-8 encoded files because Microsoft applications that were not explicitly unicode aware read files in the system's locale encoding and only the BOM will change that.

Personally I agree that we should just create the file with a UTF-8 BOM as it will work on all platforms and PS versions for now. Once WinPS support is dropped we should go back to just using UTF-8 without the BOM.

gaelcolas commented 4 years ago

Thanks for all the analysis and explanation. It does makes sense.

The actual impact for the DSC Community is minimum as we've enforced UTF8 NoBOM for a long time, along with Localization files which are not affected. (We seem to remember something that breaks when there's a BOM, but can't remember what from the top of our heads).

The change in behaviour in #86 is a breaking change so please remember to increment the Major field.

If I get this right, it's a problem if our code has Unicode chars (i.e. Write-Host ééééé?).

Do you see other potential impact?

Jaykul commented 4 years ago

Why do you think it's a breaking change to switch to consistently adding the BOM?

Currently, if you have non-ascii characters in your module, and you build on PowerShell Core, your module won't load correctly on Windows PowerShell. To be clear: it might not break (e.g. weirdly named variables or internal functions or comments and documentation might not affect functionality), but it will be read incorrectly by the shell, and could produce incorrect output or even fail to import, depending on the character(s) and where they are and how they're misinterpreted.

The bottom line is that currently, we get different output from builds on PowerShell Core than on Windows PowerShell. But the modules built by Windows PowerShell aren't "broken", so I don't see how making them all be consistent would be considered a breaking change.

NOTE: The BOM itself is not a problem as far as I know -- except in .ps1 scripts with shebangs, where it intereferes with parsing of the shebang. UTF-16 encoding is a problem in many places (e.g. in git), but we're not switching encodings, just adding a BOM (and you're not expected to have the output in source control anyway).

P.S. @jborean93 No matter how many times unix fans repeat it, the Unicode UTF spec does not say should not -- What it says is:

Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature.

In the context of PowerShell modules, the data (code) is being converted from UTF-16, because that's how strings are encoded in .NET

NET uses the UTF-16 encoding (represented by the UnicodeEncoding class) to represent characters and strings. Applications that target the common language runtime use encoders to map Unicode character representations supported by the common language runtime to other encoding schemes. They use decoders to map characters from non-Unicode encodings to Unicode.

jborean93 commented 4 years ago

Eh, the “not recommended“ part of that quote to me says SHOULD NOT, and I wouldn’t say I’m a Unix fan, I use whatever is convenient/good for the job at the time.

As for how .NET stores strings internally I don’t think that matters at all. It could be using UTF-32 or some other esoteric encoding for all we care. That is all internals to the CLR and effectively a black box from a users perspective.

Jaykul commented 4 years ago

From the perspective of specifications, there's a big big difference between "neither required nor recommended" and "should not":

This means don't do it (e.g. don't send a BOM over HTTP when you already sent a content header):

If other signaling methods are used, signatures should not be employed.

But this means you don't need them to match the spec, but if you need them for some other reason, the spec doesn't care:

Use of a BOM is neither required nor recommended for UTF-8 ...

Which is why the sentence ends with:

... or where the BOM is used as a UTF-8 signature.

Anyway. The reason that .NET's native encoding matters is because it became PowerShell's native formating. If you run, for instance New-ModuleManifest in PowerShell 5.1 you get a UTF-16 LE file, and PowerShell ISE always saves files as UTF-16. So, whenever we convert to another encoding form, you "may encounter" the use of a BOM 😉

jborean93 commented 4 years ago

The reason that .NET's native encoding matters is because it became PowerShell's native formating. If you run, for instance New-ModuleManifest in PowerShell 5.1 you get a UTF-16 LE file

That to me is an implementation detail, it was PowerShell that decided to make New-ModuleManifest output a file in UTF-16 and they decided to change that implementation for PSCore. By this logic you should also say File.WriteAllText or even Set-Content would output as UTF-16 because that's how .NET represents strings internally. That is not the case and the encoding the CLR uses should not affect or matter at all to the end user. It's the method implementation that control how the strings are represented in bytes when bytes are needed.

If you run, for instance New-ModuleManifest in PowerShell 5.1 you get a UTF-16 LE file, and PowerShell ISE always saves files as UTF-16

Are you sure, when I test it on Windows 10 with PS v5.1 ISE is saving the script as UTF-8 with a BOM.

In any case this doesn't really matter anymore, sounds like the solution is to continue creating files as UTF-8 with a BOM.

gaelcolas commented 4 years ago

No worries for the breaking change, we enforced encoding to utf8 for now to keep it working the same way, will update to utf8nobom when supported on windows ps.

bravo-kernel commented 4 years ago

Hi folks, somehow upgrading to 1.5.2 my build-phase is broken, seems related.

my manifest file is and always has been UTF-8 with BOM
building on PSVersion 5.1.17763.771, PSEdition Desktop
builds using 1.5.2 can no longer seem to find the manifest, see this build log
switching back to 1.0.0 all builds complete fine

Any pointers on what could be going wrong here?

gaelcolas commented 4 years ago

@bravo-kernel I think that's a different issue. I think we've seen that too but we currently pin to 1.0.0 because of a semver issue (which is not ModuleBuilder's fault). If you open a new request I might tackle that as I think we should change how the module manifest lookup is done. The current way also seems to fail on Linux for some reasons...

bravo-kernel commented 4 years ago

@gaelcolas thanks for the quick confirm/reassurance, i will open a ticket in a moment 👍. Had to pin 1.0.0 as well to keep going.

Jaykul commented 4 years ago

The build log linked is gone, but yes, this is a totally different issue -- a regression related to #49

It's not picking up "Path" as an alias for "SourcePath" and in fact it only works if you use "ModuleManifest" ...

We haven't caught it in our tests because the default value for Path is the name of the folder. Yours is named differently, but in all of mine, the default value is actually right.

bravo-kernel commented 4 years ago

Confirmed for my case, thanks !

PoshCode / ModuleBuilder

Make Build-Module psm1 encoding to UTF8NoBOM on PS5.1 #83

I think this means that we need to do the opposite of what I suggested above!