MicrosoftDocs / PowerShell-Docs

The official PowerShell documentation sources
https://learn.microsoft.com/powershell
Other
1.96k stars 1.56k forks source link

Clarify UTF8 and UTF8BOM/UTF8NoBOM #4021

Closed WilliamXieMSFT closed 5 years ago

WilliamXieMSFT commented 5 years ago

UTF8: Encodes in UTF-8 format. UTF8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM) UTF8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)

For UTF8, I assume this is UTF8NoBOM? Can there be some clarifying text around this?

Would it be possible to add a helpful note around the change for defaults? PS5.1 (ASCII) to PS6 (UTF8NoBOM)?


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

sdwheeler commented 5 years ago

The change is covered in the release notes and there is a detailed article about encoding.

WilliamXieMSFT commented 5 years ago

Thanks for the links, Sean! My gripe was that there's both UTF8 and UTF8NoBOM, which feels like UTF8NoBOM is redundant.

sdwheeler commented 5 years ago

UTF8 and UTF8NoBOM are different. UTF8 has a byte-order-mark (BOM) at the beginning of the file. UTF8NoBOM does not. The BOM is not always compatible across applications and platforms.

arfmach commented 4 years ago

In fact, UTF8BOM is not recognized by Out-File cmdlet. I'm using PowerShell version 5.1.18362.145 and the output is

Out-File : Não é possível validar o argumento no parâmetro 'Encoding'. O argumento "UTF8BOM" não pertence ao conjunto "unknown;string;unicode;bigendianunicode;utf8;utf7;utf32;ascii;default;oem" especificado pelo atributo ValidateSet.
Forneça um argumento que esteja no conjunto e tente o comando novamente.
No linha:1 caractere:20
+ Out-File -Encoding UTF8BOM Teste.txt
+                    ~~~~~~~
    + CategoryInfo          : InvalidData: (:) [Out-File], ParameterBindingValidationException
    + FullyQualifiedErrorId : ParameterArgumentValidationError,Microsoft.PowerShell.Commands.OutFileCommand
hilari0n commented 3 years ago

I don't understand why this question was dismissed/closed. I don't see any comments actually explaining/addressing the issue. The documentation was mentioning 3 distinct options (it was for PowerShell 6, I believe), and it still does for PowerShell versions 7.0, 7.1 and 7.2 (emphasis mine):

  • utf8: Encodes in UTF-8 format.
  • utf8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM)
  • utf8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)

None of those document versions mention, if the first option ("utf8") will encode with or without Byte Order Mark (or if this behavior is dependent on platform or whatever). The documents linked by @sdwheeler describe that the default encoding has changed (to "UTF8NoBOM") and how encoding works in PowerShell in general. None of those two mention if "utf8" encoding in PowerShell uses BOM or not. The second document mentions "utf8" as one without BOM, but in the VSCode context, not PowerShell. This document: https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_character_encoding?view=powershell-7.1#character-encoding-in-windows-powershell Mentions "UTF8" encoding as one with BOM, but it actually states, that this is for PowerShell 5.1, so it fails to list "utf8BOM" or "utf8NoBOM". In that light it's hard to assess if it applies to PowerShell 7+ in any way.

All in all, there seems to be no consistent and clear document for PowerShell 7+ addressing the issue raised by @WilliamXieMSFT. If I'm mistaken, please share the links or appropriate quotes from already linked documents.

Edit: And just as I have posted, I have found some relevant mention in the document I have linked - in a section for PowerShell Core, below the one I linked before. It does explain how BOM works for all 3 options:

  • utf8: Encodes in UTF-8 format (no BOM).
  • utf8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM)
  • utf8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)

I fail to understand why documentation for Out-File (and others, e.g. Export-Csv) can't also be clear on that, i.e. that the option utf8 does not use BOM for PowerShell Core (and it did use BOM for Windows PowerShell). Esp. that this option apparently underwent an important change.

me-suzy commented 3 years ago

hello. I try to convert a file from UTF-8 to UTF-8 BOM, and the code in Powershell is not working, gat an error that say:

"Unable to match the identifier name utf8NoBOM to a valid enumerator name. Specify one of the following enumerator names and try again: Unknown, String, Unicode, Byte, BigEndianUnicode, UTF8, UTF7, UTF32, Ascii, Default, Oem, BigEndianUTF32""

This is the powershell code:

$a = "C:/Folder1/TEST_ro.txt"
 $b = "C:/Folder1/TEST_ro-2.txt"
 (Get-Content -path $a) | Set-Content -Encoding UTF8BOM -Path $b
sdwheeler commented 3 years ago

@me-suzy PowerShell 6+ Supports the following encodings:

Windows PowerShell 5.1 (and earlier) supports:

Note that this does not include utf8NoBOM.

me-suzy commented 3 years ago

ok, I search on internet, and I find 2 SOLUTIONS. Very easy, one by using REGEX , the other is done by Python Script, just using Notepad++

https://community.notepad-plus-plus.org/topic/21200/change-save-encoding-how-to-convert-800-txt-files-utf-8-to-utf-8-bom

MikeRosoft commented 2 years ago

To fully answer the question (by directly testing the following commands in powershell.exe and pwsh.exe on my system):

$data=@{}
$data.Foo='Foo'
$data.Bar='Bar'
$data | ConvertTo-Json | Out-File 'c:\temp\data.json' -Encoding utf8

In Windows Powershell (version 5.1) this writes the file with BOM. In Powershell Core (version 7.2.1) this writes the file without BOM.

the-working-rene commented 4 weeks ago

In the current documentation for the Set-Content and Add-Content commands, it is still not clear, what exactly "utf8" means (with or without BOM).

utf8: Encodes in UTF-8 format. utf8BOM: Encodes in UTF-8 format with Byte Order Mark (BOM) utf8NoBOM: Encodes in UTF-8 format without Byte Order Mark (BOM)

Thanks to the comments here and after trying out, "utf8" is equivalent to "utf8NoBOM" - a little hint for that behavior would be really helpful in those documentations. Especially because the behavior changed between PowerShell 5.1 and PowerShell 6.