Character encoding problems with nuspec files and PowerShell scripts

TomOne commented 11 years ago

Update from 2013-10-29: Note that some information in this thread is wrong:

There’s actually no character encoding named ANSI. It’s a wrong term for the Windows-1252 codepage or other Windows-specific character encodings.
The cpack command does output nuspec files in UTF-8.

Original issue text: I discovered that the cpack command generates a nupkg which contains a nuspec file with the ANSI character encoding, even if the original nuspec file had UTF-8. ANSI is considered as outdated and deprecated. Here is a nice explanation of this topic: http://www.differencebetween.net/technology/protocols-formats/difference-between-ansi-and-utf-8/ The use of ANSI instead of UTF-8 destroys all characters in the Chocolatey gallery which are not saved in the first byte of the character encoding, for example the μ in μTorrent, the © and many other special characters. It seems that the NuGet.exe is responsible for this. Because of this currently I’m using the NuGet Package Explorer application to build my packages, which uses UTF-8. That works perfectly. I didn’t find any command line switches for NuGet.exe to force UTF-8.

Perhaps some of you think this is an irrelevant issue. But I think it is important to switch to UTF-8 in order to eliminate character encoding problems once and for all times. Almost all modern Linux distributions have already done so. Thumbs up for them. :)

TomOne commented 10 years ago

OK, but I still don’t agree with putting

Either:

Do not use non-ASCII characters.

into the wiki. I thought the idea is to write less (but sensible) guidelines, not more. I’ll try to explain why that “non-ASCII characters” guidelines does not make sense, using a little story (inspired by https://github.com/chocolatey/chocolatey/wiki/ChocolateyStory)

There are two guys, Bob and Richard. Both are Windows users and discovered Chocolatey a few moths ago. Now they also want to maintain some packages of their favourite software.

Richard doesn’t care much about guidelines and rules, he starts immediately with creating packages and doesn’t read the Chocolatey wiki. He uses the Windows Editor to edit his nuspec files and saves them with the default character encoding (Micromisleadingsoft® ANSI™). But in some of his packages there are characters like ©, ® and en dashes. They get displayed with ? signs in the Chocolatey Gallery, but he doesn’t even notice it. After a few weeks he coincidentally notices this problem, but he does not care.

Unlike Richard, Bob is assiduous. He reads the Create Packages page and learns that nuspec files must be saved as UTF-8 without BOM. He didn’t know much about character encodings before, but he configures his Notepad++ editor to save files as UTF-8 without BOM by default. Now he won’t have any problems with character encodings in nuspec files and enjoys creating packages.

Unfortunately there are many Richard-like package maintainers on Chocolatey.org. For them, a suggestion in the wiki like “Do not use non-ASCII characters.” wouldn’t have any effect. And for people like Bob, such a suggestion would be totally superfluous, because the solution to set the correct character encoding in the editor once is certainly less complicated than having to remember each time when editing a nuspec to not use non-ASCII characters.

Don’t be a Richard. :smile:

ferventcoder commented 10 years ago

Tom you crack me up. :D

On Tuesday, October 29, 2013, TomOne wrote:

OK, but I still don’t agree with putting

Either:

Do not use non-ASCII characters.

into the wiki. I thought the idea is to write less (but sensible) guidelines, not more. I’ll try to explain why that “non-ASCII characters” guidelines does not make sense, using a little story (inspired by https://github.com/chocolatey/chocolatey/wiki/ChocolateyStory):

There are two guys, Bob and Richard. Both are Windows users and discovered Chocolatey a few moths ago. Now they also want to maintain some packages of their favourite software.

Richard doesn’t care much about guidelines and rules, he starts immediately with creating packages and doesn’t read the Chocolatey wiki. He uses the Windows Editor to edit his nuspec files and saves them with the default character encoding (Microsoft® ANSI™). But in some of his packages there are characters like ©, ® and en dashes. They get displayed with ?signs in the Chocolatey Gallery, but he doesn’t even notice it. After a few weeks he coincidentally notices this problem, but he does not care.

Unlike Richard, Bob is assiduous. He reads the Create Packageshttps://github.com/chocolatey/chocolatey/wiki/CreatePackagespage and learns that nuspec files must be saved as UTF-8 without BOM. He didn’t know much about character encodings before, but he configures his Notepad++ editor to save files as UTF-8 without BOM by default. Now he won’t have any problems with character encodings in nuspec files and enjoys creating packages.

Unfortunately there are many Richard-like package maintainers on Chocolatey.org. For them, a suggestion in the wiki like “Do not use non-ASCII characters.” wouldn’t have any effect. And for people like Bob, such a suggestion would be totally superfluous, because the solution to set the correct character encoding in the editor once is certainly less complicated than having to remember each time when editing a nuspec to not use non-ASCII characters.

Don’t be a Richard. [image: :smile:]

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27351020 .

Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

gep13 commented 10 years ago

An out there suggestion, but...

Can we create a choco package that configuring this as a default in the common editors? Assuming that these settings are exposed somewhere, it should be trivial to check whether an editor is installed, and make the update.

Job done!

TomOne commented 10 years ago

@Redsandro: I changed the character encoding section as you suggested, but left your recommendation about ASCII out. I hope that my story about Bob and Richard could convince you. :smile:

@gep13, a package that configures text editors with proper character encodings would be interesting. I’ve made some research about text editors on Windows and character encodings. We could use it to determine which editor would need a “fix”:

Notepad++ uses UTF-8 without BOM as default encoding since a few releases. But if you upgrade from older versions, it continues using Windows-1252 (incorrectly called ANSI) as default encoding.
Notepad2 uses that stupid “ANSI” by default. :-1:
PSPad too
Sublime Text 2 and 3 use UTF-8 without BOM as default encoding.
Geany: UTF-8 without BOM by default. :+1:
Gedit for Windows too.
Brackets too.
Vim/Gvim isn’t even able to to show most Unicode characters. But I think that’s a Windows related issue. In addition, GVim/Vim on Windows doesn’t use UTF-8 as default character encoding. But Vim on Linux shows all Unicode characters provided by font and uses UTF-8 without BOM by default.

As you can see, most cross-platform editors (except Vim) use UTF-8 without BOM by default. Unfortunately this means that most of the Windows-only community adopted Microsoft’s retarded standards policy. It’s always annoying when a stubborn company insists on obsolete standards and mostly ignores user feedback and bug reports. If someone claims Microsoft is innovative, I will point the finger at him and laugh at him. Innovation can never coexist together with such an ultraconservative and ignorant behaviour.

I’m sorry that I criticize Microsoft with these words. But of course my main interest here is to improve Chocolatey. :)

Redsandro commented 10 years ago

@TomOne: Thanks! :)

Although I disagree with your story. Meet Hank. Always in a hurry. Dyslexic. Hates reading. Hates editors. Hates Windows. Doesn't care about other people's mess, but cares about end user quality.

Will he be scared away by half a page of encoding guidelines, or will he be saved by a simple rule to keep in mind?

The fact is, there are many Richard-like package maintainers on Chocolatey.org. Without them, Chocolatey would be worth less. Hank-like packagers are better than Richard-like packagers.

Don't chase away net quality improving Hanks. :runner:

Let's agree to disagree.

TomOne commented 10 years ago

You’ve probably never looked at Debian’s guidelines for package maintainers: http://www.debian.org/doc/manuals/maint-guide/

The fact is: being a package maintainer requires great responsibility. Therefore one must spend some time to read rules and guidelines of the community. If one does not like to do so, he should never ever join a community.

Meet Hank. Always in a hurry. Dyslexic. Hates reading. Hates editors. Hates Windows. Doesn't care about other people's mess, but cares about end user quality.

That’s a paradoxical behaviour. A guy who cares about end user quality, but has all the negative character properties you described? Very unlikely if you ask me. I’m glad that I’ve never met such a person. :smile:

TomOne commented 10 years ago

By the way, if Debian’s documentation would scare away potential package maintainers, then why is it so popular? Why are there so many Debian package maintainers?

Don’t tell me that almost all Windows users are like Richard or Hank. If that would be true, I would immediately remove Windows from all computers I have access to and throw my Windows license out the window. :smile:

Redsandro commented 10 years ago

Yes I know the docs, I run Debian myself. :bowtie:

But I would never be a Debian packager. Their process is why many popular software isn't packaged for Debian. That's why there has been an ITP (intent to package) of Ubuntu One for 5 years now and it still isn't there. FreeFileSync has been ITP for 3 years and never seen the light.

I’m glad that I’ve never met such a person. :smile:

Well I might have exaggerated a bit, but here I am.

Like Hank, I care about the quality of the user experience. Not about rules for the sake of rules, when it is proven that 9 out of 10 packages work fine when ignoring them. My packages are awesome, and apart from one*, they are all just default ANSI and no one will ever notice or care or have a problem with them.

*) Hence my rule, for this one package your rules applied, and now it works fine too. User experience quality: 100%.

(TBH I made one mistake in one package about guidelines but that one is unlisted now.)

Don’t tell me that almost all Windows users are like Richard or Hank.

I didn't say that. I said that there are many Richard-like package maintainers on Chocolatey.org which is an exact quote from you. To prove my point. That if everyone would share my opinion, Chocolatey's supply would be higher quality than it is today, because Hanks care more than Richards.

And to make it more fun, I actually think most Windows users are Richards. Certainly not all. There are a lot of Bobs. Only the above-average Windows user will even know/use/package (for) Chocolatey, and you said it yourself: there are many Richard-like package maintainers. Imagine how many Richards there are in the entire pool of Windows users. :scream:

(Just to clarify, I'm contradicting you because I don't agree, but in no way do I want to imply that all Windows users are Richards. All devs are awesome. Windows devs are even more awesome because they have to dig through those weird Microsoft syntaxes. :stuck_out_tongue: )

-edit-

Same thing happened with the multipackage installation code I proposed. ferventcoder totally rewrote it in clever ways inside the vbs in stead of outside. But for the user experience the end result is the same. It's a great addition either way.

Let's just say we all care about something and we should applaud that. Am I wrong? :wink:

TomOne commented 10 years ago

OK, then another try: Take a look at Arch Linux’ guidelines: https://wiki.archlinux.org/index.php/Creating_Packages Don’t tell me that there are too few packages for Arch Linux. :smile:

Yes, unlike the Linux package world, there are many Richard-like Chocolatey package maintainers. The reasons for that are quite obvious:

Until now, Chocolatey does not have a documentation that covers all important topics. That’s understandable, because it’s a relatively young project.
There is no moderation or inspection of new packages by experienced maintainers. Imagine you live in a world without police. It’s obvious that such a world laws wouldn’t be followed by most people, which would end in a total chaos.

And remember, we’re talking about guidelines for package maintainers here, not for average Windows users. I still don’t understand why you insist on recommending to use obsolete character encodings, just to support the laziness. It’s so easy to change that setting in the editor. I don’t think that Chocolatey package maintainers are idiots.

Maintainers also need to know about PowerShell if they want to create packages that do more than just invoking the Install-ChocolateyPackage helper function. Isn’t that a lot more work than changing the character encoding setting in the editor to UTF-8?

Redsandro commented 10 years ago

Microsoft chose ANSI and by ANSI they meant Windows-1252. I don't know if that's part of an "embrace, extend and extinguish" strategy, but when I am on a Windows machine, it's the law.

They made that a default, and when I use that, everything works. Except for characters that you don't find on MSX-Basic either. So I don't use them. Windows doesn't care because it doesn't need to decode characters. As long as you don't use characters that need decoding.

It is extremely simple.

You are seriously comparing this to total chaos?

I'm gonna write my next package in Notepad.exe and no one will notice. :smiling_imp:

TomOne commented 10 years ago

Hm, then why did the NuGet developers decide to use UTF-8? Aren’t some of them Microsoft employees? :smile:

You are seriously comparing this to total chaos?

You’re misinterpreting this. I didn’t talk just about the character encoding problem, it’s just a scenario if no one would respect rules.

Another example: Internet Explorer is the default browser in Windows. Why do most people make the effort and install an alternative browser? Internet Explorer will just work fine for them, if they can live without proper HTML5/CSS3 support. :laughing:

If you like simplicity, then go forth and use Internet Explorer. We must accept Microsoft’s laws, otherwise we will hurt us. :smile:

I’m sorry for that irony, but your arguments arguments make little sense to me.

Redsandro commented 10 years ago

Your Internet Explorer analogy is flawed. If there's one thing I keep repeating I care about most, it's user experience. Stay away from IE! :smile:

I don't see how you don't get it. Fact: Most packages work perfectly. Fact: Most packages don't have or need character encoding because they don't need character decoding. Fact: If you do need decoded characters, you have to take care of character encoding.

These are the facts, so I am factually right, although you can disagree with my opinion about the thoughts that surround it.

So again, agree to disagree, or be wrong. Your choice. :P For the rest, I'm not interested in discussing the matter any further.

TomOne commented 10 years ago

Stay away from IE! :smile:

Yes. Friends don’t let friends use Internet Explorer. But IE isn’t the only Microsoft stuff I don’t like. I’m quite happy that their server software with their locked in ecosystem (Windows Server, ASP/ASP.NET, IIS) is losing market share to Linux/OSS solutions. :smile: Let’s build a better an Internet that belongs to everyone, using openness and innovation. :+1:

You’re right, let’s stop that discussion here. So here’s a short summary of the results:

Recommendation for maintainers who read the wiki: UTF-8.
If maintainers don’t use UTF-8, no problem as long as they don’t use non-ASCII characters.
Incorrectly displayed characters on Chocolatey.org are not a serious problem as long as the affected packages are functional.

Redsandro commented 10 years ago

Agreed, but when maintainers notice they 'suddenly' have a character problem, they should want to read the guidelines in more detail and discover what is happening and fix the problem. If not, we give them a yellow card. :P

TomOne commented 10 years ago

Of course.

ferventcoder commented 10 years ago

You both crack me up. I get the humor on both sides, but from this conversation I'm not yet sure you get each other's humor. :)

On Friday, November 1, 2013, TomOne wrote:

Of course.

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27613262 .

Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

TomOne commented 10 years ago

:smile:

I have an idea how we can solve this problem once and for all, even for Richards, Hanks and for people who have not a single clue about character encodings. As you know, there’s much work on the computer that can be automated. Why not automate a character encoding conversion to the correct encodings before a package gets built? That conversion could be for example executed as first step of the cpack command.

Now to the technical part: I already mentioned that character encoding detection is not a simple task. I requires a lot more than just a few lines of code. This document describes the approaches: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

For Python, there exists a nice tool to detect the character encoding: https://pypi.python.org/pypi/chardet/2.1.1

I also discovered a PowerShell script that do the same: http://danspowershellstuff.blogspot.it/2012/02/get-file-encoding-even-if-no-byte-order.html.

If we were on Linux, these things would be a lot easier: using file -i to detect the character encoding and iconv to convert it.

rismoney commented 10 years ago

Why not automate a character encoding conversion to the correct encodings before a package gets built?

I like it. -Rich :)

ferventcoder commented 10 years ago

That's actually interesting...

Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

On Sat, Nov 2, 2013 at 12:08 PM, Rich Siegel notifications@github.comwrote:

Why not automate a character encoding conversion to the correct encodings before a package gets built?

I like it. -Rich :)

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27626278 .

Redsandro commented 10 years ago

I don't think that will work. Once you save your file in standard (according to Windows) encoding, the character becomes unreadable when you reopen the file. But when I change encoding to utf-8, the char remains unreadable. I have to enter it again.

But if it will work, it's clever. What are computers for? To make our lives easier. :)

ferventcoder commented 10 years ago

Well it was a cool theory anyway. ;)

Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

On Sat, Nov 2, 2013 at 3:27 PM, Redsandro notifications@github.com wrote:

I don't think that will work. Once you save your file in standard (according to Windows) encoding, the character becomes unreadable when you reopen the file. But when I change encoding to utf-8, the char remains unreadable. I have to enter it again.

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27631734 .

TomOne commented 10 years ago

I don't think that will work. Once you save your file in standard (according to Windows) encoding, the character becomes unreadable when you reopen the file. But when I change encoding to utf-8, the char remains unreadable. I have to enter it again.

I don’t understand what you mean by that. Why would the characters become unreadable? Of course the character encoding conversion works. I tested it myself using Cygwin with the file -i and iconv method.

Redsandro commented 10 years ago

Once certain characters are □, like □Torrent □2013 bitTorrent□, you can't get them back from a conversion.

But if you say there's no problem, I'm not gonna argue. I'd love to have cpack take care of this!

I □ Unicode.

-edit-

I just tried. If you save it directly and load it directly, there is no problem. But if you save your package to a git repository, load it on a different machine and then push it, it's already too late.

-edit-

Or something. I have had this, but am not exactly sure what step causes this.

gep13 commented 10 years ago

Agreed, if we could wrap this up in cpack to the point that this was a non-issue, that would be best for everyone. Would the act of cpack result in changes to the physical file? i.e. if someone is using source control, (and obviously they shoudl be) would they be left with the "correct encoding on the file? If so, should cpack throw a message back to indicate what was done, adn why?

ferventcoder commented 10 years ago

It sounds like cpack would be too late in the process to be helpful.

gep13 commented 10 years ago

Would it be worth creating another command that could be used ahead of cpack? A dedicated function for doing just this?

The only problem with this is that it would be down to the user to run the command, and to know to run the command, which takes us back to the same problem.

The only other suggestion would be to add an “automatic moderation” step, post upload to the chocolatey servers. Almost synonymous with a compilation, where we do all the necessary work, in this case the encoding, before setting the package live.

ferventcoder commented 10 years ago

I guess I'm missing something now. You can't get back items that have been converted to ASCII format where those items are now boxes. So converting it back to UTF8 doesn't win you anything.

TomOne commented 10 years ago

The nuspec files are never converted to ASCII, so that’s not the problem here. The problem we need to solve is that some package maintainers save them in Windows-1252 instead of UTF-8.

TomOne commented 10 years ago

It sounds like cpack would be too late in the process to be helpful.

Why too late? When you for example execute the cpack command, the character encodings (Windows-1252, UCS-2 or other weird encodings) could be converted into UTF-8 first and then the cpack command could do the same like it already does now.

ferventcoder commented 10 years ago

It could be I missed this https://github.com/chocolatey/chocolatey/issues/294#issuecomment-27632769

TomOne commented 10 years ago

@Redsandro, you wrote this:

Once certain characters are □, like □Torrent □2013 bitTorrent□, you can't get them back from a conversion.

I just tried. If you save it directly and load it directly, there is no problem. But if you save your package to a git repository, load it on a different machine and then push it, it's already too late.

I tried to reproduce this with Git, but the characters are all fine. This is the workflow I used:

Save uTorrent.nuspec with Windows-1252 encoding, including an μ character.
Commit & push
Git clone to another directory on the computer
Convert the nuspec to UTF-8 without BOM
Commit & push
Git clone to another folder on the computer
Open nuspec, the editor correctly recognises the file as UTF-8 and the μ character gets displayed correctly

So here absolutely nothing gets lost. I used https://github.com/tomone/suggestions as my testing repository.

But if you try do download an already built package like μTorrent, unpack it and open the nuspec, your editor will probably show �Torrent. Also setting the encoding to Windows-1252 does not help. Building a package probably causes non-UTF-8 encodings to be destroyed. I don’t know what there happens in the background, because I have only basic knowledge about character encodings.

So the only workflow where the character encoding would be destroyed is the following:

Create a package, using some non-ASCII characters like ©, ®, μ or –.
Save nuspec as Windows-1252
Build package
On every package update, take a previously built package (the nupkg), extract it, insert the new version and download links and then build it again.

Well, this is a weird workflow, but of course we also must take this into account. What about to let the cpack command scan the nuspec file for weird characters caused by a broken encoding and stop in that case? A meaningful error message could be displayed and then the package maintainer must investigate that character encoding problem.

-- edit -- corrected step 4 of the second workflow

TomOne commented 10 years ago

I’ve investigated this problem and now I know a possible solution and what there happens in the background:

Windows-1252 is an 8-bit character encoding, while ASCII is has 7-bit.
In UTF-8, characters encoded with a single byte between (hex) 80 and FF are invalid, but in Windows-1252 these addresses represent characters like µ, Ä, ® and ©.
NuGet assumes UTF-8 and therefore replaces the invalid single-byte-characters between 80 and FF with the Unicode replacement character � (�), which is an option to handle this.
Therefore this replacement character shows up in the gallery for example in the μTorrent package
However, RFC 3629 states that implementations of the decoding algorithm must protect against decoding invalid sequences. The Unicode standard requires that decoders treat any invalid character sequence as an error condition.
If NuGet would follow RFC 3629 and throw an error when a maintainer attempts to package a Windows-1252 encoded nuspec with 80 to FF characters, we wouldn’t have any problem.
Therefore, the best way to fix this issue is to implement such an error message on invalid UTF-8 characters directly in NuGet. Has anybody tested newer NuGet versions to see if this has been implemented already? If not, I would open an issue for this on CodePlex.
In the case the NuGet developers ignore this issue, I would really like if it gets implemented into Chocolatey’s cpack, together with a character encoding detection and conversion to UTF-8.

-- edit -- Sorry, it’s (hex) 80 and not 7F.

TomOne commented 10 years ago

Sorry, I forgot the solution for nuspecs whose Windows-1252 80 to FF characters have already been destroyed and replaced with �: cpack could detect if the � character is present in the nuspec, abort in that case and throw an error message. This message should mention that UTF-8 should be used and that � characters must be replaced with the correct characters, otherwise it won’t proceed.

Let’s improve Chocolatey’s package quality and ban those annoying � characters. :+1:

I :heart: Unicode.

Redsandro commented 10 years ago

This I like.

Although I always thought cpack was just a reference plus some alternate configuration to the nuget pack binary. Customizing and compiling binaries seems like a difficult thing. But it's too far from my bed so what do I know. :)

TomOne commented 10 years ago

Although I always thought cpack was just a reference plus some alternate configuration to the nuget pack binary.

Yes, it’s a Windows Batch file in C:\Chocolatey\bin\cpack.bat.

But if we really want to fix this character encoding problem, we need to add a control structure for it to the cpack command, since that is the only contact point between the package source files and Chocolatey before the package gets pushed. That’s the only possibility, because there will always be people who don’t read and respect rules and guidelines. And even if there would be some kind of moderation for Chocolatey packages, it’s always better when something gets automated when it’s possible, so it would be less work for the moderators.

A good program doesn’t let the user make mistakes if they can be detected, but also doesn’t treat users like idiots. :smiley: I think this is one of the key rules of good usability.

TomOne commented 10 years ago

I wrote a small Bash-script which automatically converts all *.nuspec files from other encodings to UTF-8. This could be very useful for package maintainers that have many packages and want to make sure that every character in their nuspecs gets displayed correctly in the Gallery.

In addition, the script removes BOMs and replaces any XML declaration with <?xml version="1.0" encoding="utf-8"?>, so that package maintainers can see more easily that the file is encoded as UTF-8.

This script requires a Bash interpreter (on Linux, Mac OS X and Windows with Cygwin) and the program “recode” to perform the character encoding conversion. Recode is not installed by default in Cygwin and in most Linux distributions, but it’s quite easy to install it.

Redsandro commented 10 years ago

This is clever. But also pretty far-fetched in my opinion. And by that I mean that no Windows packager with a right mind will have a bash interpreter running.

I don't mean that as an insult because according to those rules, I don't have a right mind either. :P

TomOne commented 10 years ago

Of course not every Windows user has Cygwin installed. I wrote this script in Bash because I like cross-platform solutions. Personally I don’t like programming/scripting languages which work only on one platform, especially when that platform is a non-free proprietary operating system.

Nobody is forced to use this script and nothing prevents anybody from writing a similar script in PowerShell. :smiley:

TomOne commented 10 years ago

Woah, there’s a PowerShell implementation for Linux: http://pash.sourceforge.net/ :smile:

Redsandro commented 10 years ago

Disregard my earlier message. I was thinking in terms of the cpack pipeline. I thought you were working towards a solution for all packagers from which there is no escape, preferably without them even knowing. :)

TomOne commented 10 years ago

I thought you were working towards a solution for all packagers from which there is no escape, preferably without them even knowing.

Yes, that would be a better solution. First I looked how such a script could be written in PowerShell, but I discovered that it’s a lot more complicated (because you have to write the non-trivial character encoding recognition from scratch) and my PowerShell skills aren’t good enough.

TomOne commented 10 years ago

I have a new idea. Let me make a summary of a possible approach that would fix this issue: The cpack command should perform the following checks before calling NuGet’s pack command:

Check if the affected *.nuspec file is not encoded as UTF-8.
Check if an UTF-8 BOM is present.
Check if one or more Unicode replacement characters (�) are present in the *.nuspec file.

If one of the checks above is true, cpack should throw an exception and inform the package maintainer that something is wrong, i.e. that it’s needed to convert the file to UTF-8. Then the maintainer could fix it and execute cpack again.

I think this approach is much better than additional automatic charset conversion because of the following reasons:

It has a better learning effect for maintainers. If an automatic conversion would be performed, some maintainers would continue to use legacy character encodings (such as Windows-1252), without ever knowing that UTF-8 would be the correct encoding.
Package maintainers would learn that an UTF-8 BOM is a bad thing.
To replace � with the correct characters, it would be necessary anyway to manually edit the *.nuspec file.
As I already mentioned, a character encoding recognition is never 100 % accurate in all cases. In some rare cases, an automatic character encoding conversion to UTF-8 could result in an output file with wrong characters.

Please let me know if you have doubts and if you would be interested in this solution.

Redsandro commented 10 years ago

@TomOne UTFCast Express might be relevant to your interests.

(Did not use, just stumbled across)

TomOne commented 10 years ago

@Redsandro, thanks for the hint. Unfortunately UTFCast Express only supports the ASCII, UTF-8, UTF-16 Little Endian, UTF-16 Big Endian encodings. It does not support 8-bit-encodings (e.g Windows-1252) which mostly cause our encoding issues at Chocolatey.

I couldn’t find any free program which performs a batch convert of text files and supports all relevant encodings. To develop one would be a nice FOSS project though, especially useful for Windows users/developers :smiley:

Some time ago I created a small node.js script which performs this task for a folder with Chocolatey/NuGet package source files: https://github.com/TomOne/cpkg-utf8-conv

It is still very naive, but it works. Of course it is not suitable to automagically prevent the encoding issues we have, but it’s nice tool to clean existing package repositories. At least I like it more than my previous Bash script for this task. :smile:

TomOne commented 10 years ago

BTW, some time ago I came across EditorConfig. It’s a plugin for many popular editors which automatically sets specific parameters for specific files in a repository. For example, you can write into the config file that all .nuspec files should be saved as UTF-8 and all .ps1 files as UTF-8 with BOM.

I already use it for my packages repository: https://github.com/TomOne/chocolatey-packages/blob/master/.editorconfig

@ferventcoder, it would be nice to integrate this also into the https://github.com/chocolatey/chocolateytemplates repository. So we can recommend to package maintainers to install the EditorConfig plugin and at least those folks won’t create packages with inappropriate character encodings.

ferventcoder commented 10 years ago

@TomOne we have editor config in here. If you send me a PR over in the templates repo, I would accept it there. But I'm not sure what good it would do since that is not a chocolatey packages template repository that they start from (if that makes sense), just one that warmup uses to copy the templates into their chocolatey packages repo.

TomOne commented 10 years ago

But I'm not sure what good it would do since that is not a chocolatey packages template repository that they start from (if that makes sense).

That’s right, it does not help much. But we could add to the character encoding section in the wiki that package maintainers should – if they use EditorConfig – copy the .editorconfig file into the root directory of the repo. Unfortunately I noticed that only the plugin for Sublime Text seems to support the charset property of EditorConfig, so only Sublime Text users could actually benefit from that. But note that Sublime Text is one of the most popular code editors nowadays, so it wouldn’t be completely useless.

BTW, the following section is a bit offtopic, but I think it could be interesting for us:

In my previous comments I mentioned that Notepad++ (the editor with the most downloads on Chocolatey.org) uses an incorrect term to denote “UTF-8 without BOM”, which I also wrote into the character encoding section in the wiki. I created a patch on the SourceForge Project site for Notepad++ to correct that term. Let’s hope that the Notpad++ devs eventually respond to that issue, so that we could reduce a little bit the confusion about character encodings that seems to persist in the Microsoft/Windows community.

ghost commented 9 years ago

Question, if we are told to save as UTF-8 without BOM, then why are the files saved as UTF-8 with BOM?

$ wget rawgit.com/chocolatey/chocolateytemplates/3ea3/_templates/chocolatey/tools/chocolateyInstall.ps1

$ hexdump -C chocolateyInstall.ps1
00000000  ef bb bf 23 4e 4f 54 45  3a 20 50 6c 65 61 73 65  |...#NOTE: Please|

TomOne commented 9 years ago

Because PowerShell uses deprecated character encodings (known as Windows codepages and often incorrectly called “ANSI”) by default and a BOM is the only way to force UTF-8.

See https://github.com/chocolatey/chocolatey/wiki/CreatePackages#character-encoding for more information.

gep13 commented 9 years ago

From the link posted above:

If you don’t respect this rule, some characters are not displayed correctly in the Gallery on Chocolatey.org, because the Gallery assumes UTF-8.

chocolatey-archive / chocolatey

Character encoding problems with nuspec files and PowerShell scripts #294