PowerShell / PSScriptAnalyzer

Download ScriptAnalyzer from PowerShellGallery
https://www.powershellgallery.com/packages/PSScriptAnalyzer/
MIT License
1.87k stars 378 forks source link

-Fix adding UTF-8 BOM to files #1743

Open jborean93 opened 3 years ago

jborean93 commented 3 years ago

Steps to reproduce

Create a file with a fixable violation

Function Test-Function {
    param (
        [String]
        $Password
    )

    $Password
}

Run Invoke-ScriptAnalyzer -Path ./file.ps1 -Fix to fix the violation

Expected behavior

The violation is fixed but the BOM isn't added to the file. It should preserve whatever is there already.

Actual behavior

The violation is fixed but there is a UTF-8 BOM character added to the file

# Before
$ hexdump -C file.ps1

00000000  46 75 6e 63 74 69 6f 6e  20 54 65 73 74 2d 46 75  |Function Test-Fu|
00000010  6e 63 74 69 6f 6e 20 7b  0a 20 20 20 20 70 61 72  |nction {.    par|
00000020  61 6d 20 28 0a 20 20 20  20 20 20 20 20 5b 53 74  |am (.        [St|
00000030  72 69 6e 67 5d 0a 20 20  20 20 20 20 20 20 24 50  |ring].        $P|
00000040  61 73 73 77 6f 72 64 0a  20 20 20 20 29 0a 0a 20  |assword.    ).. |
00000050  20 20 20 24 50 61 73 73  77 6f 72 64 0a 7d 0a     |   $Password.}.|
0000005f

# After running -Fix
# hexdump -C file.ps1

00000000  ef bb bf 46 75 6e 63 74  69 6f 6e 20 54 65 73 74  |...Function Test|
00000010  2d 46 75 6e 63 74 69 6f  6e 20 7b 0a 20 20 20 20  |-Function {.    |
00000020  70 61 72 61 6d 20 28 0a  20 20 20 20 20 20 20 20  |param (.        |
00000030  5b 53 65 63 75 72 65 53  74 72 69 6e 67 5d 0a 20  |[SecureString]. |
00000040  20 20 20 20 20 20 20 24  50 61 73 73 77 6f 72 64  |       $Password|
00000050  0a 20 20 20 20 29 0a 0a  20 20 20 20 24 50 61 73  |.    )..    $Pas|
00000060  73 77 6f 72 64 0a 7d 0a                           |sword.}.|
00000068

Note the ef bb bf (UTF-8 BOM) present. I tried explicitly ignoring the rule UseBOMForUnicodeEncodedFile as well but the BOM is still added.

Environment data

> $PSVersionTable

Name                           Value
----                           -----
PSVersion                      7.2.0
PSEdition                      Core
GitCommitId                    7.2.0
OS                             Linux 5.14.16-201.fc34.x86_64 #1 SMP Wed Nov 3 13:57:29 UTC 2021
Platform                       Unix
PSCompatibleVersions           {1.0, 2.0, 3.0, 4.0…}
PSRemotingProtocolVersion      2.3
SerializationVersion           1.1.0.1
WSManStackVersion              3.0

> (Get-Module -ListAvailable PSScriptAnalyzer).Version | ForEach-Object { $_.ToString() }

1.20.0
ninmonkey commented 3 years ago

This likely won't do anything, but, what if you set the environment's encoding to non-BOM UTF8?

$global:OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = [System.Text.UTF8Encoding]::new()
StevenBucher98 commented 2 years ago

Thanks @jborean93, we were able to successful reproduce this issue and identify the bug. We appreciate you taking the time to reproduce this bug, we are unsure when we will be able to fix this issue.

bergmeister commented 2 years ago

Because -Fix reads the content of the file into a string, modifies it and then writes back to the string, the issue could either be in

This is a very tricky and time consuming one. Bear in mind that there is no standard for encoding and both .NET and editors apply heuristics to figure out the encoding of a file and therefore can get it wrong. My recommendation is to check the encoding in the diff editor before commiting files after using -Fix as this parameter is usually just used for one off fixes of a whole codebase.

jborean93 commented 2 years ago

I personally don't think the onus should be on the user to detect the BOM that was added. If the file doesn't have a BOM then the formatter shouldn't be adding one. I'm unsure what the level of support PSSA has for WinPS but considering the default of PowerShell since v6 is to read BOM-less files as UTF-8 then I personally believe PSSA should output as UTF-8 if the input didn't have a BOM.

Putting aside the issues with detecting the encoding, scripts that only contain ASCII characters won't be affected by reading the script as 1 encoding and writing it as UTF-8 as the bytes will be all the same. Having a mismatch set of encoding values only affects non-ASCII characters. Even if WinPS was still a concern I honestly think having a default of UTF-8 as the output when no BOM was detected on the input will help encourage users to have a UTF-8 + BOM file in the first place. Relying on the default locale encoding is very brittle, especially when it comes to sharing scripts as the default could differ across hosts.