Open TomOne opened 11 years ago
@gep13 that quote is taken out of context. The full quote is
Use the UTF-8 character encoding for the .nuspec and .ps1 files. If you don’t respect this rule, some characters are not displayed correctly in the Gallery on Chocolatey.org, because the Gallery assumes UTF-8.
This does not explicitly say what bad can happen if you use .nuspec
with BOM.
Perhaps the wiki needs to be refined, but my understanding is that the problem is exactly as stated in the latter part of the quote. If you don't do what is described here, then characters are rendered incorrectly when pushed to the Chocolatey Gallery. @TomOne will be able to provide more information on this, as my understanding on this issue is slim, at best.
There is only one way to really find out what happens if you use a BOM with the nuspec. :)
I think you will be okay, as long as you don't use a BOM with a Dutch
nuspec
.
Met vriendelijke groet,
~Sander
On Mon, Oct 27, 2014 at 2:59 PM, Rob Reynolds notifications@github.com wrote:
There is only one way to really find out what happens if you use a BOM with the nuspec. :)
— Reply to this email directly or view it on GitHub https://github.com/chocolatey/chocolatey/issues/294#issuecomment-60595836 .
I don't understand all the anti-BOM advice.
A BOM is a very big clue for the software that reads the file that the file is UTF-8. Without the BOM, the software needs to guess by examining the file contents - an inexact and error-prone process - or assume some default encoding. True, there are programs out there which have incomplete/broken UTF-8 implementation and choke on a BOM, but they are not relevant to Chocolatey. On the other hand, .NET I/O classes by default take advantage of the BOM to unambiguously (or, at least, with a very high degree of probability) detect the encoding of opened files, so all .NET programs (in particular, PowerShell and NuGet) are BOM-aware.
As a BOM is required for ps1 files (due to PowerShell not playing the guessing game) and not-required-but-sometimes-helpful-for-text-editors for nuspec files, an advice for package creators to always use a BOM would be simpler to follow and less error prone.
All I care about is that the unicode
characters show properly in packages and on the website/gallery.
They don't by default (when using BOM
).
@TomOne figured out a set of rules to follow that makes the packages end up nicely with all unicode characters in the chocolatey gallery. Personally, I don't care or understand too much about the theory behind this. I do however know that it doesn't work properly if you don't follow this advise.
If you are saying that this is also possible with a simple fix in cpack
/cpush
, I'm all for that.
https://github.com/chocolatey/chocolatey/blob/master/src/functions/Chocolatey-Pack.ps1
https://github.com/chocolatey/chocolatey/blob/master/src/functions/Chocolatey-Push.ps1
No fix is necessary, it works right now (0.9.8.27).
https://chocolatey.org/packages/utf8-test
https://github.com/jberezanski/ChocolateyPackages/blob/master/utf8-test/utf8-test.nuspec (note: github web view will not show the BOM, even in Raw mode - you need to clone the repo if you wish to verify)
The package was built using cpack
and pushed with cpush
.
@jberezanski why does that package have no version history, and no moderation status?
It does have when I'm logged in, does not when visiting the page anonymously. Probably because it has not been approved yet (not that I expect it to ever be).
Sweet! Are you sure you ignored all the 'rules'?
Did @ferventcoder update things? Can you verify, @TomOne? If anything, the instructions are sexier if we don't need the encoding hackery.
When the advice went out we were on Nuget.exe 2.1 and we are using 2.8 now. That could be the difference.
@jberezanski and @svnpenn were right to raise the issue. I guess the character guidelines are obsolete.
This is great news if it is fixed, as it will ease some of the stricter requirements. Would be great to get @TomOne who originally raised the issue to confirm that everything is now working as expected.
@Redsandro:
Are you sure you ignored all the 'rules'?
I ignored one rule: "Do not save your *.nuspec files with a Byte Order Mark (BOM)". I followed the other ones.
@ferventcoder:
When the advice went out we were on Nuget.exe 2.1 and we are using 2.8 now. That could be the difference.
Possible. I found this one reference from 2011 to a NuGet bug with respect to BOM handling, but the bug had apparently been fixed immediately after being reported. Not sure what NuGet version number was that.
My test proves that current versions of NuGet and the Chocolatey gallery correctly handle nuspec files saved as UTF-8 with BOM. To sum it up, the technical requirements currently are: 1) nuspec files: should be UTF-8, BOM is optional 2) ps1 files: should be UTF-8, BOM is mandatory
I therefore suggest the following change to the character encoding guidance: Instead of those three points
• Do not save your *.nuspec files with a Byte Order Mark (BOM). A BOM is neither required nor recommended for UTF-8 , because it can lead to several issues.
• PowerShell scripts need to be saved in UTF-8 with BOM . PowerShell is ignoring the standards and needs a BOM in order to recognize scripts as UTF-8 . Otherwise it processes non ASCII characters incorrectly.
• Don’t use the default Windows Editor. In addition to its lack of features, it can’t even save UTF-8 files without BOM (...)
this:
• DO save your *.nuspec files and PowerShell scripts (*.ps1) with a Byte Order Mark (BOM). It helps the tools parse non-ASCII characters correctly and is required by PowerShell.
Update from 2013-10-29: Note that some information in this thread is wrong:
Original issue text: I discovered that the cpack command generates a nupkg which contains a nuspec file with the ANSI character encoding, even if the original nuspec file had UTF-8. ANSI is considered as outdated and deprecated. Here is a nice explanation of this topic: http://www.differencebetween.net/technology/protocols-formats/difference-between-ansi-and-utf-8/ The use of ANSI instead of UTF-8 destroys all characters in the Chocolatey gallery which are not saved in the first byte of the character encoding, for example the μ in μTorrent, the © and many other special characters. It seems that the NuGet.exe is responsible for this. Because of this currently I’m using the NuGet Package Explorer application to build my packages, which uses UTF-8. That works perfectly. I didn’t find any command line switches for NuGet.exe to force UTF-8.
Perhaps some of you think this is an irrelevant issue. But I think it is important to switch to UTF-8 in order to eliminate character encoding problems once and for all times. Almost all modern Linux distributions have already done so. Thumbs up for them. :)