Character encoding problems with nuspec files and PowerShell scripts

TomOne commented 11 years ago

Update from 2013-10-29: Note that some information in this thread is wrong:

There’s actually no character encoding named ANSI. It’s a wrong term for the Windows-1252 codepage or other Windows-specific character encodings.
The cpack command does output nuspec files in UTF-8.

Original issue text: I discovered that the cpack command generates a nupkg which contains a nuspec file with the ANSI character encoding, even if the original nuspec file had UTF-8. ANSI is considered as outdated and deprecated. Here is a nice explanation of this topic: http://www.differencebetween.net/technology/protocols-formats/difference-between-ansi-and-utf-8/ The use of ANSI instead of UTF-8 destroys all characters in the Chocolatey gallery which are not saved in the first byte of the character encoding, for example the μ in μTorrent, the © and many other special characters. It seems that the NuGet.exe is responsible for this. Because of this currently I’m using the NuGet Package Explorer application to build my packages, which uses UTF-8. That works perfectly. I didn’t find any command line switches for NuGet.exe to force UTF-8.

Perhaps some of you think this is an irrelevant issue. But I think it is important to switch to UTF-8 in order to eliminate character encoding problems once and for all times. Almost all modern Linux distributions have already done so. Thumbs up for them. :)

ghost commented 9 years ago

@gep13 that quote is taken out of context. The full quote is

Use the UTF-8 character encoding for the .nuspec and .ps1 files. If you don’t respect this rule, some characters are not displayed correctly in the Gallery on Chocolatey.org, because the Gallery assumes UTF-8.

This does not explicitly say what bad can happen if you use .nuspec with BOM.

gep13 commented 9 years ago

Perhaps the wiki needs to be refined, but my understanding is that the problem is exactly as stated in the latter part of the quote. If you don't do what is described here, then characters are rendered incorrectly when pushed to the Chocolatey Gallery. @TomOne will be able to provide more information on this, as my understanding on this issue is slim, at best.

ferventcoder commented 9 years ago

There is only one way to really find out what happens if you use a BOM with the nuspec. :)

Redsandro commented 9 years ago

I think you will be okay, as long as you don't use a BOM with a Dutch nuspec.

Met vriendelijke groet,

~Sander

http://www.Redsandro.com/

On Mon, Oct 27, 2014 at 2:59 PM, Rob Reynolds notifications@github.com wrote:

There is only one way to really find out what happens if you use a BOM with the nuspec. :)

— Reply to this email directly or view it on GitHub https://github.com/chocolatey/chocolatey/issues/294#issuecomment-60595836 .

jberezanski commented 9 years ago

I don't understand all the anti-BOM advice.

A BOM is a very big clue for the software that reads the file that the file is UTF-8. Without the BOM, the software needs to guess by examining the file contents - an inexact and error-prone process - or assume some default encoding. True, there are programs out there which have incomplete/broken UTF-8 implementation and choke on a BOM, but they are not relevant to Chocolatey. On the other hand, .NET I/O classes by default take advantage of the BOM to unambiguously (or, at least, with a very high degree of probability) detect the encoding of opened files, so all .NET programs (in particular, PowerShell and NuGet) are BOM-aware.

As a BOM is required for ps1 files (due to PowerShell not playing the guessing game) and not-required-but-sometimes-helpful-for-text-editors for nuspec files, an advice for package creators to always use a BOM would be simpler to follow and less error prone.

Redsandro commented 9 years ago

All I care about is that the unicode characters show properly in packages and on the website/gallery.

They don't by default (when using BOM).

@TomOne figured out a set of rules to follow that makes the packages end up nicely with all unicode characters in the chocolatey gallery. Personally, I don't care or understand too much about the theory behind this. I do however know that it doesn't work properly if you don't follow this advise.

But

If you are saying that this is also possible with a simple fix in cpack/cpush, I'm all for that.

https://github.com/chocolatey/chocolatey/blob/master/src/functions/Chocolatey-Pack.ps1
https://github.com/chocolatey/chocolatey/blob/master/src/functions/Chocolatey-Push.ps1

jberezanski commented 9 years ago

No fix is necessary, it works right now (0.9.8.27). https://chocolatey.org/packages/utf8-test https://github.com/jberezanski/ChocolateyPackages/blob/master/utf8-test/utf8-test.nuspec (note: github web view will not show the BOM, even in Raw mode - you need to clone the repo if you wish to verify) The package was built using cpack and pushed with cpush.

ghost commented 9 years ago

@jberezanski why does that package have no version history, and no moderation status?

jberezanski commented 9 years ago

It does have when I'm logged in, does not when visiting the page anonymously. Probably because it has not been approved yet (not that I expect it to ever be). utf8-test

Redsandro commented 9 years ago

Sweet! Are you sure you ignored all the 'rules'?

Did @ferventcoder update things? Can you verify, @TomOne? If anything, the instructions are sexier if we don't need the encoding hackery.

ferventcoder commented 9 years ago

When the advice went out we were on Nuget.exe 2.1 and we are using 2.8 now. That could be the difference.

Redsandro commented 9 years ago

@jberezanski and @svnpenn were right to raise the issue. I guess the character guidelines are obsolete.

gep13 commented 9 years ago

This is great news if it is fixed, as it will ease some of the stricter requirements. Would be great to get @TomOne who originally raised the issue to confirm that everything is now working as expected.

jberezanski commented 9 years ago

@Redsandro:

Are you sure you ignored all the 'rules'?

I ignored one rule: "Do not save your *.nuspec files with a Byte Order Mark (BOM)". I followed the other ones.

@ferventcoder:

When the advice went out we were on Nuget.exe 2.1 and we are using 2.8 now. That could be the difference.

Possible. I found this one reference from 2011 to a NuGet bug with respect to BOM handling, but the bug had apparently been fixed immediately after being reported. Not sure what NuGet version number was that.

My test proves that current versions of NuGet and the Chocolatey gallery correctly handle nuspec files saved as UTF-8 with BOM. To sum it up, the technical requirements currently are: 1) nuspec files: should be UTF-8, BOM is optional 2) ps1 files: should be UTF-8, BOM is mandatory

I therefore suggest the following change to the character encoding guidance: Instead of those three points

• Do not save your  *.nuspec  files with a Byte Order Mark (BOM). A  BOM  is neither required nor recommended for  UTF-8 , because it can lead to several issues.
• PowerShell scripts need to be saved in UTF-8 with  BOM . PowerShell is ignoring the standards and needs a  BOM  in order to recognize scripts as  UTF-8 . Otherwise it processes non  ASCII  characters incorrectly.
• Don’t use the default Windows Editor. In addition to its lack of features, it can’t even save  UTF-8  files without  BOM (...)

this:

• DO save your *.nuspec files and PowerShell scripts (*.ps1) with a Byte Order Mark (BOM). It helps the tools parse non-ASCII characters correctly and is required by PowerShell.

chocolatey-archive / chocolatey

Character encoding problems with nuspec files and PowerShell scripts #294

But