chocolatey-archive / chocolatey

[DEPRECATED - https://github.com/chocolatey/choco] Chocolatey NuGet - Like apt-get, but for windows.
https://chocolatey.org
Apache License 2.0
2.81k stars 345 forks source link

Character encoding problems with nuspec files and PowerShell scripts #294

Open TomOne opened 11 years ago

TomOne commented 11 years ago

Update from 2013-10-29: Note that some information in this thread is wrong:

Original issue text: I discovered that the cpack command generates a nupkg which contains a nuspec file with the ANSI character encoding, even if the original nuspec file had UTF-8. ANSI is considered as outdated and deprecated. Here is a nice explanation of this topic: http://www.differencebetween.net/technology/protocols-formats/difference-between-ansi-and-utf-8/ The use of ANSI instead of UTF-8 destroys all characters in the Chocolatey gallery which are not saved in the first byte of the character encoding, for example the μ in μTorrent, the © and many other special characters. It seems that the NuGet.exe is responsible for this. Because of this currently I’m using the NuGet Package Explorer application to build my packages, which uses UTF-8. That works perfectly. I didn’t find any command line switches for NuGet.exe to force UTF-8.

Perhaps some of you think this is an irrelevant issue. But I think it is important to switch to UTF-8 in order to eliminate character encoding problems once and for all times. Almost all modern Linux distributions have already done so. Thumbs up for them. :)

ferventcoder commented 11 years ago

Oh I agree with you on this completely. Yet another reason to stop using nuget.exe for the dirty work in the future.

TomOne commented 11 years ago

What are the other reasons to stop using nuget.exe? It should be possible to modify the source code of nuget.exe to get nice and clean UTF-8 files. Unfortunately I’m not able to do that, because I’m not familiar with C#. But also the gallery on nuget.org would benefit from such an improvement. They have the same issue with UTF-8 and ANSI.

ferventcoder commented 11 years ago

It's possible the new version has this figured out. No the idea is to switch to the lib version of nuget to internalize commands more.

On Wednesday, May 29, 2013, TomOne wrote:

What are the other reasons to stop using nuget.exe? It should be possible to modify the source code of nuget.exe to get nice and clean UTF-8 files. Unfortunately I’m not able to do that, because I’m not familiar with C#. But also the gallery on nuget.org would benefit from such an improvement. They have the same issue with UTF-8 and ANSI.

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-18648417 .


Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

Redsandro commented 10 years ago

The problem is fixed for the website and feeds if people just use html characters, like µTorrent.

TomOne commented 10 years ago

That could be a workaround, but I read on multiple coding guidelines that HTML entities should be avoided, except for non distinguishable characters, e. g.   or  . HTML entities make the code harder to read and to maintain.

But it seems that newer versions of NuGet use UTF-8 without BOM for the nuspec when a package gets built. I tested it a few weeks ago. A good improvement, even if I would like if they used UTF-8 starting from their first releases. I think other character encodings are obsolete in most cases since the existence of Unicode.

Redsandro commented 10 years ago

If it works, then I am happy.

But personally I find it more important that all these "trashy" looking packages show up nicely on the site.

All these "???" failed encoding symbols degrade the perceived quality of packages or the whole Chocolatey system for users.

So if every maintainer would just fix those ???'s into proper encodings until it is no longer necessary, no one will notice the encoding is screwed up under the hood.

TomOne commented 10 years ago

I have bad news for you: NuGet doesn’t even have a full support for HTML/XML entities. If you use an unsupported entity, the cpack command will fail and output an error: … not declared entity … Line x, Position y

Here are some examples of commonly used entities:

Supported by NuGet:

Not supported by NuGet (outputs the error described above):

Well, I don’t think it’s very comfortable to use the decimal or hexadecimal notation.

Redsandro commented 10 years ago

Okay, that sucks.

But the fact is, this unicode thing doesn't work. It should, but it doesn't. Only ansi.

I'm concerned about the page looking wrong. So for now, packagers shouldn't use unicode characters. Use (C) (R) u etc. in stead.

Failed characters

ferventcoder commented 10 years ago

yeah, this part sucks.

TomOne commented 10 years ago

I’m very sorry. I made a few mistakes here, mostly because of my inadequate knowledge about character encodings. Therefore I’ve done some research to improve it. ;) So here’s the correction:

But fortunately there’s a simple solution for that. We must make sure that package maintainers never use another encoding than UTF-8 without BOM. They should configure their editors to handle that encoding properly. Note that UTF-8 without BOM is the recommended character encoding for HTML pages and fortunately many websites use it, including nuget.org and chocolatey.org. So if web developers are able to set the character encoding properly, why not also chocolatey package maintainers?

The thing with HTML/XML entities is definitely more complicated than setting the correct encoding, so I would leave the entities out.

@ferventcoder, this character encoding stuff would be another rule that should be added to the Wiki. I would be happy to do that.

Redsandro commented 10 years ago

I didn't know that, thanks for looking it up. So these problematic packages have a nuspec that was saved with improper encoding.

Yes, please add a note to the wiki. :)

Redsandro commented 10 years ago

Geany Editor Encoding

TomOne commented 10 years ago

You’re using Geany. That’s cool. :+1:

Note that “UTF-8” in Geany means “UTF-8 without BOM” which is fine, while for example Notepad++ displays “UTF-8” when the file contains a BOM. In Notepad++, UTF-8 without BOM is incorrectly called “ANSI as UTF-8”. That’s a weird terminology.

It doesn’t surprise me that there is so much confusion about character encodings, mostly thanks to Microsoft. :-1: But at least some Linux- and cross-platform editors like Geany chose the right terminology. :)

Redsandro commented 10 years ago

You recognize Geany. That's cool! :smile:

TomOne commented 10 years ago

Microsoft itself accepted that ANSI is an inappropriate term, see https://en.wikipedia.org/wiki/Windows_code_page#ANSI_code_page. So that term shouldn’t be used anymore.

And by the way, there are already bug reports for Notepad++ to correct these incorrect terms:

If would really like if the Notepad++ devs fix these annoying incorrect and misleading terms, so please post your comments into these bug reports if you can, so we can give these bugs finally the attention they deserve.

TomOne commented 10 years ago

I wrote this paragraph regarding the character encoding. As soon as you approve it, I’ll add it to the Creating Packages section in the Wiki, right after Rules to be observed before publishing packages:

Character encoding

Redsandro commented 10 years ago

Wow, that's elaborate.

I've added a line to the guide for people who don't like reading:

If you anchor your character encoding paragraph, I can link to it.

ferventcoder commented 10 years ago

GH wiki anchors headers automatically.

On Friday, October 25, 2013, Redsandro wrote:

Wow, that's elaborate.

I've added a line to the guide for people who don't like readinghttps://github.com/chocolatey/chocolatey/wiki/CreatePackagesQuickStart :

  • You must save your files with UTF-8 character encoding without BOM.

If you anchor your character encoding paragraph, I can link to it.

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27098358 .


Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

ferventcoder commented 10 years ago

Go forth and wiki. :)

On Friday, October 25, 2013, TomOne wrote:

I wrote this paragraph regarding the character encoding. As soon as you approve it, I’ll add it to the Creating Packageshttps://github.com/chocolatey/chocolatey/wiki/CreatePackagessection in the Wiki, right after Rules to be observed before publishing packageshttps://github.com/chocolatey/chocolatey/wiki/CreatePackages#rules-to-be-observed-before-publishing-packages : Character encoding

  • Use the UTF-8 character encoding for all plain text files in your package, especially for the .nuspec and .ps1 files. If you don’t respect this rule, some characters are not displayed correctly in the Gallery on chocolatey.org, because the Gallery assumes UTF-8.
  • Don’t save your UTF-8 files with a byte order mark (BOM). A BOM is neither required nor recommended for UTF-8, because it can lead to several issues.
  • Note that there’s a lot of confusion in the world of character encodings: For example, ANSI is an incorrect term for the non-standardized Windows character encodings, e. Windows-1252. But you should not use this encoding family anyway. In addition, Notepad++ incorrectly uses the term “ANSI as UTF-8” for UTF-8 without BOM. If you select UTF-8 in Notepad++, it means UTF-8 with BOM.
  • Specify the UTF-8 encoding in the first line of your nuspec files. Then the first line looks like this: <?xml version="1.0" encoding="utf-8"?>.

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27079309 .


Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

TomOne commented 10 years ago

Amen :)

Redsandro commented 10 years ago

Speaking of wiki rules, those rules are annoying to read for people who have sexdaily dyslexia or otherwise have difficulty reading.

It is pragmatically clever to have every basic rule summarized within the rule and bold that section.

E.g.:

  • Packages of software that is illegal in most countries in the world are prohibited to publish on chocolatey.org. This applies in particular to software that violates the copyright, pirated software and activation cracks. Remember that this also affects software that is especially designed to accomplish software piracy.

Could be:

  • Don't package illegal software. Packages of software that is illegal in most countries in the world are prohibited to publish on chocolatey.org. This applies in particular to software that violates the copyright, pirated software and activation cracks. Remember that this also affects software that is especially designed to accomplish software piracy.
Redsandro commented 10 years ago

TomOne, I don't see your section yet. Once you updated, can you give me the anchor?

TomOne commented 10 years ago

The anchor is generated automatically as @ferventcoder already mentioned: https://github.com/chocolatey/chocolatey/wiki/CreatePackages#character-encoding

Damn, I just found out that PowerShell doesn’t even recognize UTF-8 files without BOM. It uses Microsoft’s ancient Windows-1252 encoding by default. So we have to save PowerShell scripts in UTF-8 with BOM to work correctly. Sorry Microsoft, but we live in 2013. Nobody wants your stupid non-standardized encodings! :-1:

TomOne commented 10 years ago

@Redsandro, you are right, improve that section for people who have dyslexia. :)

Redsandro commented 10 years ago

@TomOne if I may make a suggestion..

  • Use the UTF-8 character encoding for the .nuspec and .ps1 files. If you don’t respect this rule, some characters are not displayed correctly in the Gallery on chocolatey.org, because the Gallery assumes UTF-8.
  • Don’t save your nuspec files with a byte order mark (BOM). A BOM is neither required nor recommended for UTF-8, because it can lead to several issues.

:smile:

-edit-

Oh wait, are you saying the DOM part is uncertain now? I'm confused. Luckily I don't often use strange characters in my packages.

TomOne commented 10 years ago

UTF-8 and the byte order mark are two different things. A BOM is a special character at the beginning of a file which signals the endianness (byte order of a text file). Because the endianness has absolutely no meaning in UTF-8, it is not recommended to use a BOM there, see https://en.wikipedia.org/wiki/Byte_order_mark.

But there are some programs out there which don’t follow this standard, e. g. PowerShell and the Windows Editor. As you can see, Microsoft is retarded when it comes to this topic. It reminds me a bit of the story with IE, which web developers like so much. Haha. ;)

Redsandro commented 10 years ago

Hmm. Windows is known to follow their own standard. Chocolatey is Windows-only. Let's just follow what they 'say' is the standard. :tongue:

Just like this line ending thing. Linux: LF OSX: CR Windows: CRLF

Create a package in Linux and in Windows it's all messed up if you don't use the proper newline code. But I shouldn't be using Linux to create packages for Windows. :P

gep13 commented 10 years ago

When you mention hosting packages elsewhere, when they are not intended for everyone, is it worth mentioning somewhere like MyGet.org? i.e. The package can still be on a public feed, but not polluting the Chocolatey.org feed?

Gary

Redsandro commented 10 years ago

You are replying to the character encoding wiki thing. I am not sure what you mean. Did you mean to reply to #355 ? Not sure what you mean there either. I'm sure other public feeds won't appreciate unacceptable packages either.

ferventcoder commented 10 years ago

No BOM- it causes more issues than it helps. I find it weird that powershell would get ut wrong. Do you have a gist where it shows it outputting the detected encoding?

On Saturday, October 26, 2013, TomOne wrote:

The anchor is generated automatically as @ferventcoderhttps://github.com/ferventcoderalready mentioned: https://github.com/chocolatey/chocolatey/wiki/CreatePackages#character-encoding

Damn, I just found out that PowerShell doesn’t even recognize UTF-8 files without BOM. It uses Microsoft’s ancient Windows-1252 encoding by default. So we have to save PowerShell scripts in UTF-8 with BOM to work correctly. Sorry Microsoft, but we live in 2013. Nobody wants your stupid non-standardized encodings! [image: :-1:]

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27150483 .


Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

gep13 commented 10 years ago

@Redsandro You are right, my command is directed here:

https://github.com/chocolatey/chocolatey/issues/355

Sorry :)

TomOne commented 10 years ago

Wrong, not the “no BOM” causes issues, Microsoft causes them by ignoring standards!

A gist is not necessary to show you PowerShell’s behavior. PowerShell simply assumes that all PowerShell scripts are in that crappy Windows-1252 encoding unless there’s a BOM. Then it simply reads the character addresses from the file and shows them if they were Windows-1252. Example:

test.ps1, encoded in UTF-8 without BOM:

Write-Host "αβγ © ™ ΩΣ"

Output in Powershell

In C:\Users\Someuser\test.ps1:1 Character:31
+ Write-Host "αβγ © ™ ΩΣ"
+                               ~
The string has no terminator: ".
    + CategoryInfo          : ParserError: (:) [], ParseException
    + FullyQualifiedErrorId : TerminatorExpectedAtEndOfString

I think the character encoding section in the Wiki is fine. It clearly explains what a person should do and clarifies the confusion that Microsoft has created. If someone doesn’t understand this, he’s probably retarded and shouldn’t create packages. ;)

TomOne commented 10 years ago

And by the way, all modern Linux distributions use UTF-8 without BOM by default. The W3C doesn’t recommend BOMs for the web.

BOMs for UTF-8 really cause issues: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 Some issues that I’ve experienced myself when using UTF-8 with a BOM:

Well, Microsoft seems to insist on BOMs when using UTF-8. But only because Chocolatey is a Windows software, it doesn’t mean that it has to adapt Microsoft’s inability to follow standards and recommendations, especially when it comes to such fundamental things like character encoding.

I read that from multiple sources that Microsoft is not very open to any suggestions. For example, they keep ignoring bug reports and feature requests for Internet Explorer and automatically delete them after about 48 hours. What kind of weird strategy is this? Are they planning their downfall of their company? Well, at least for IE it seems to work, when I look on the progression of the market share of Internet Explorer. :grinning:

ferventcoder commented 10 years ago

I mostly ignore things I can't change and change the things I can change.


Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

On Sat, Oct 26, 2013 at 3:31 PM, TomOne notifications@github.com wrote:

And by the way, all modern Linux distributions use UTF-8 without BOM by default. The W3C doesn’t recommend BOMs for the webhttp://www.w3.org/International/questions/qa-byte-order-mark.en.php .

BOMs for UTF-8 really cause issues: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 Some issues that I’ve experienced myself when using UTF-8 with a BOM:

  • .gitignore files don’t work properly. The expression in the first line fails because the BOM character stands in the way.
  • A colleague created a small node.js script which merges multiple CSV files. The output file contained multiple BOM characters, even in the middle of a line. Then my colleague imported that into a database using phpMyAdmin. Then he experienced database inconsistencies. It took him over an hour until he found out that the BOMs were causing that issue.

Well, Microsoft seems to insist on BOMs when using UTF-8. But only because Chocolatey is a Windows software, it doesn’t mean that it has to adapt Microsoft’s inability to follow standards and recommendations, especially when it comes to such fundamental things like character encoding.

I read that from multiple sources that Microsoft is not very open to any suggestions. For example, they keep ignoring bug reports and feature requests for of Internet Explorer and automatically delete them after about 48 hours. What kind of weird strategy is this? Are they planning their downfall of their company? Well, at least for IE it seems to work, when I look on the progression of the market share of Internet Explorer. [image: :grinning:]

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27155100 .

Redsandro commented 10 years ago

That's some wisdom right there. ;)

TomOne commented 10 years ago

I mostly ignore things I can't change and change the things I can change.

Sounds pretty clear to me: We make chocolatey better by leaving the character encoding section as it is and let IE die. :grinning:

Sorry if I mention it, but your wisdom phrase has an inconsistency: How can you be sure that you can’t change something? If you take a closer look at history, you will notice great people who have seemingly done the impossible. But today we know that it wasn’t impossible. Thanks to this kind of people we now have democracy, women's suffrage and no more slavery. At least in the countries where we live.

edit: sorry, I meant something, not anything ;)

Redsandro commented 10 years ago

Isn't the problem solved by using the 'proper' charset? On Oct 26, 2013 11:52 PM, "TomOne" notifications@github.com wrote:

I mostly ignore things I can't change and change the things I can change.

Sounds pretty clear to me: We make chocolatey better by leaving the character encoding section as it is and let IE die. [image: :grinning:]

Sorry if I mention it, but your wisdom phrase has an inconsistency: How can you be sure that you can’t change anything? If you take a closer look at history, you will notice great people who have seemingly done the impossible. But today we know that it wasn’t impossible. Thanks to those people we now have democracy, women's suffrage and no more slavery. At least in the countries where we live.

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27156620 .

TomOne commented 10 years ago

Of course :) Simply UTF-8 without BOM for *.nuspec files and UTF-8 with BOM for PowerShell scripts.

Redsandro commented 10 years ago

Notepad++ incorrectly uses the term ANSI as UTF-8 for UTF-8 encoded files without a BOM.

What version are you running? I have no "ANSI as UTF-8" option, only "UTF-8 without BOM"

Notepad++ encoding

Version 6.2.2 Build Nov18 2012. (year old)

ferventcoder commented 10 years ago

Open the file after saving it that way. That's what I think Tom is referring to.

On Sunday, October 27, 2013, Redsandro wrote:

Notepad++ incorrectly uses the term ANSI as UTF-8 for UTF-8 encoded files without a BOM.

What version are you running? I have no "ANSI as UTF-8" option, only "UTF-8 without BOM"

[image: Notepad++ encoding]https://github-camo.global.ssl.fastly.net/6e724cac3b3f0d47542fe3987f08a1934383d45f/687474703a2f2f692e696d6775722e636f6d2f525a445571674c2e706e67

— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27183337 .


Rob "Be passionate in all you do"

http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder

Redsandro commented 10 years ago

Oh you're right! It goes back to ANSI.

TomOne commented 10 years ago

Screenshot of Notepad++ 6.5: npp-ansi-as-utf-8 Notepad++ is inconsistent, that’s even worse. It’s also mentioned in these two bug reports:

TomOne commented 10 years ago

@Redsandro, I think you misunderstood something here. When you open a file with no special characters, the encoding cannot be determined by the text editor. Then it depends on the settings of the editor which encoding is assumed. In Notepad++, you can override this setting if you tick “Apply to opened ANSI files” like in the screenshot above, which will assume UTF-8 for every file that has no special characters and no BOM. That’s the standard for every Linux editor I know, and it’s a good standard. :+1:

TomOne commented 10 years ago

Sorry, I meant “which will assume UTF-8 for every file that has no non-ASCII characters and no BOM.”

TomOne commented 10 years ago

I updated the title of this issue, because the old one was misleading and wrong.

Now I think we have every information needed to fix the character encoding problem. If you have no further doubts or questions, this issue can be closed.

Redsandro commented 10 years ago

Documentation is important, especially for new people. Who get a first impression. So I got a few suggestions if I may. :smile:

Suggested edit that implements all of these:


Character encoding

Either:

Or:

Note: There is a lot of confusion in the world of character encodings: For example, ANSI is an incorrect term for the internal Windows character encodings, e.g. Windows-1252. But you should not use this encoding family anyway. In addition, Notepad++ incorrectly uses the term ANSI as UTF-8 for UTF-8 encoded files without a BOM. If you select UTF-8 in Notepad++, it means UTF-8 with BOM. Therefore Notepad++ must show ANSI as UTF-8 in the statusbar.


TomOne commented 10 years ago

Good suggestions, except this one:

Either: Do not use non-ASCII characters.

This is not a solution of the character encoding problem, only an amateurish circumvention for lazy people. I’m very annoyed of all these redcutions to typographically incorrect ASCII characters, like e' instead of è, ue instead of ü, ss instead of ß or (c) instead of ©.

I expect from an advanced computer user, programmer or package maintainer to know the important things about character encodings, or at least to be open to learn it. People that don’t like to read and follow rules, guidelines and standards, should never ever create and publish packages. Such people are harmful for the community and make the world to a worse place.

Don’t we all want packages with a good quality? Isn’t it worth to spend a few more minutes or a bit more to read, understand and follow the rules and guidelines of a project?

ferventcoder commented 10 years ago

How about "Avoid the use of non-ASCII characters"?

And how about we update the templates so that they are in line with these recommendations?

TomOne commented 10 years ago

Doesn’t that mean the same as “do not use”, just with other words? I definitely don’t like this recommendation. It’s like sweeping the dirt under the carpet. :-1: Seriously, what is so hard about reading a few lines and changing the character encoding settings of the editor if necessary?

I’ll make a pull request so that the templates comply with these recommendations.

Redsandro commented 10 years ago

Good suggestions, except this one

Well, I bet 95% of package maintainers (including myself) never gave it a thought and everything works as long as you stick to ASCII characters, because the encoding doesn't really matter then.

So, in practice (practice trumps theory) it's a proper and within packages widely used solution until you are a seasoned packager and want to do crazy stuff like:

μTorrent, © 2013 bitTorrent™®