Open TomOne opened 11 years ago
Oh I agree with you on this completely. Yet another reason to stop using nuget.exe for the dirty work in the future.
What are the other reasons to stop using nuget.exe? It should be possible to modify the source code of nuget.exe to get nice and clean UTF-8 files. Unfortunately I’m not able to do that, because I’m not familiar with C#. But also the gallery on nuget.org would benefit from such an improvement. They have the same issue with UTF-8 and ANSI.
It's possible the new version has this figured out. No the idea is to switch to the lib version of nuget to internalize commands more.
On Wednesday, May 29, 2013, TomOne wrote:
What are the other reasons to stop using nuget.exe? It should be possible to modify the source code of nuget.exe to get nice and clean UTF-8 files. Unfortunately I’m not able to do that, because I’m not familiar with C#. But also the gallery on nuget.org would benefit from such an improvement. They have the same issue with UTF-8 and ANSI.
— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-18648417 .
Rob "Be passionate in all you do"
http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder
The problem is fixed for the website and feeds if people just use html characters, like µTorrent
.
That could be a workaround, but I read on multiple coding guidelines that HTML entities should be avoided, except for non distinguishable characters, e. g.
or  
. HTML entities make the code harder to read and to maintain.
But it seems that newer versions of NuGet use UTF-8 without BOM for the nuspec when a package gets built. I tested it a few weeks ago. A good improvement, even if I would like if they used UTF-8 starting from their first releases. I think other character encodings are obsolete in most cases since the existence of Unicode.
If it works, then I am happy.
But personally I find it more important that all these "trashy" looking packages show up nicely on the site.
All these "???" failed encoding symbols degrade the perceived quality of packages or the whole Chocolatey system for users.
So if every maintainer would just fix those ???'s into proper encodings until it is no longer necessary, no one will notice the encoding is screwed up under the hood.
I have bad news for you: NuGet doesn’t even have a full support for HTML/XML entities. If you use an unsupported entity, the cpack
command will fail and output an error: … not declared entity … Line x, Position y
Here are some examples of commonly used entities:
Supported by NuGet:
>
<
&
¾
or ®
Not supported by NuGet (outputs the error described above):
© ® µ ™
and many more.Well, I don’t think it’s very comfortable to use the decimal or hexadecimal notation.
Okay, that sucks.
But the fact is, this unicode thing doesn't work. It should, but it doesn't. Only ansi.
I'm concerned about the page looking wrong. So for now, packagers shouldn't use unicode characters. Use (C) (R) u etc. in stead.
yeah, this part sucks.
I’m very sorry. I made a few mistakes here, mostly because of my inadequate knowledge about character encodings. Therefore I’ve done some research to improve it. ;) So here’s the correction:
cpack
command does output nuspec files in UTF-8. To be more precise, it’s UTF-8 without a BOM. As you probably know, a character encoding can’t be determined easily unless there’s a BOM. Text editors have to determine the encoding with some character detection stuff. They basically look if a character written in the text “looks weird” with a certain encoding, then they switch to the appropriate encoding. Of course this method is not 100 % reliable. If a text for example only contains ASCII characters, a text editor cannot determine the encoding, because UTF-8 has exactly the same character encoding table as ASCII, plus the Unicode table of course.
NuGet assumes that nuspec files are encoded in UTF-8 without a BOM. Therefore, if a package maintainer uses another character encoding like Windows-1252, some characters (like the famous © or μ) will be displayed incorrectly on chocolatey.org.But fortunately there’s a simple solution for that. We must make sure that package maintainers never use another encoding than UTF-8 without BOM. They should configure their editors to handle that encoding properly. Note that UTF-8 without BOM is the recommended character encoding for HTML pages and fortunately many websites use it, including nuget.org and chocolatey.org. So if web developers are able to set the character encoding properly, why not also chocolatey package maintainers?
The thing with HTML/XML entities is definitely more complicated than setting the correct encoding, so I would leave the entities out.
@ferventcoder, this character encoding stuff would be another rule that should be added to the Wiki. I would be happy to do that.
I didn't know that, thanks for looking it up. So these problematic packages have a nuspec that was saved with improper encoding.
Yes, please add a note to the wiki. :)
You’re using Geany. That’s cool. :+1:
Note that “UTF-8” in Geany means “UTF-8 without BOM” which is fine, while for example Notepad++ displays “UTF-8” when the file contains a BOM. In Notepad++, UTF-8 without BOM is incorrectly called “ANSI as UTF-8”. That’s a weird terminology.
It doesn’t surprise me that there is so much confusion about character encodings, mostly thanks to Microsoft. :-1: But at least some Linux- and cross-platform editors like Geany chose the right terminology. :)
You recognize Geany. That's cool! :smile:
Microsoft itself accepted that ANSI is an inappropriate term, see https://en.wikipedia.org/wiki/Windows_code_page#ANSI_code_page. So that term shouldn’t be used anymore.
And by the way, there are already bug reports for Notepad++ to correct these incorrect terms:
If would really like if the Notepad++ devs fix these annoying incorrect and misleading terms, so please post your comments into these bug reports if you can, so we can give these bugs finally the attention they deserve.
I wrote this paragraph regarding the character encoding. As soon as you approve it, I’ll add it to the Creating Packages section in the Wiki, right after Rules to be observed before publishing packages:
<?xml version="1.0" encoding="utf-8"?>
.Wow, that's elaborate.
I've added a line to the guide for people who don't like reading:
If you anchor your character encoding paragraph, I can link to it.
GH wiki anchors headers automatically.
On Friday, October 25, 2013, Redsandro wrote:
Wow, that's elaborate.
I've added a line to the guide for people who don't like readinghttps://github.com/chocolatey/chocolatey/wiki/CreatePackagesQuickStart :
- You must save your files with UTF-8 character encoding without BOM.
If you anchor your character encoding paragraph, I can link to it.
— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27098358 .
Rob "Be passionate in all you do"
http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder
Go forth and wiki. :)
On Friday, October 25, 2013, TomOne wrote:
I wrote this paragraph regarding the character encoding. As soon as you approve it, I’ll add it to the Creating Packageshttps://github.com/chocolatey/chocolatey/wiki/CreatePackagessection in the Wiki, right after Rules to be observed before publishing packageshttps://github.com/chocolatey/chocolatey/wiki/CreatePackages#rules-to-be-observed-before-publishing-packages : Character encoding
- Use the UTF-8 character encoding for all plain text files in your package, especially for the .nuspec and .ps1 files. If you don’t respect this rule, some characters are not displayed correctly in the Gallery on chocolatey.org, because the Gallery assumes UTF-8.
- Don’t save your UTF-8 files with a byte order mark (BOM). A BOM is neither required nor recommended for UTF-8, because it can lead to several issues.
- Note that there’s a lot of confusion in the world of character encodings: For example, ANSI is an incorrect term for the non-standardized Windows character encodings, e. Windows-1252. But you should not use this encoding family anyway. In addition, Notepad++ incorrectly uses the term “ANSI as UTF-8” for UTF-8 without BOM. If you select UTF-8 in Notepad++, it means UTF-8 with BOM.
- Specify the UTF-8 encoding in the first line of your nuspec files. Then the first line looks like this: <?xml version="1.0" encoding="utf-8"?>.
— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27079309 .
Rob "Be passionate in all you do"
http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder
Amen :)
Speaking of wiki rules, those rules are annoying to read for people who have sexdaily dyslexia or otherwise have difficulty reading.
It is pragmatically clever to have every basic rule summarized within the rule and bold that section.
E.g.:
- Packages of software that is illegal in most countries in the world are prohibited to publish on chocolatey.org. This applies in particular to software that violates the copyright, pirated software and activation cracks. Remember that this also affects software that is especially designed to accomplish software piracy.
Could be:
- Don't package illegal software. Packages of software that is illegal in most countries in the world are prohibited to publish on chocolatey.org. This applies in particular to software that violates the copyright, pirated software and activation cracks. Remember that this also affects software that is especially designed to accomplish software piracy.
TomOne, I don't see your section yet. Once you updated, can you give me the anchor?
The anchor is generated automatically as @ferventcoder already mentioned: https://github.com/chocolatey/chocolatey/wiki/CreatePackages#character-encoding
Damn, I just found out that PowerShell doesn’t even recognize UTF-8 files without BOM. It uses Microsoft’s ancient Windows-1252 encoding by default. So we have to save PowerShell scripts in UTF-8 with BOM to work correctly. Sorry Microsoft, but we live in 2013. Nobody wants your stupid non-standardized encodings! :-1:
@Redsandro, you are right, improve that section for people who have dyslexia. :)
@TomOne if I may make a suggestion..
- Use the UTF-8 character encoding for the .nuspec and .ps1 files. If you don’t respect this rule, some characters are not displayed correctly in the Gallery on chocolatey.org, because the Gallery assumes UTF-8.
- Don’t save your nuspec files with a byte order mark (BOM). A BOM is neither required nor recommended for UTF-8, because it can lead to several issues.
:smile:
-edit-
Oh wait, are you saying the DOM part is uncertain now? I'm confused. Luckily I don't often use strange characters in my packages.
UTF-8 and the byte order mark are two different things. A BOM is a special character at the beginning of a file which signals the endianness (byte order of a text file). Because the endianness has absolutely no meaning in UTF-8, it is not recommended to use a BOM there, see https://en.wikipedia.org/wiki/Byte_order_mark.
But there are some programs out there which don’t follow this standard, e. g. PowerShell and the Windows Editor. As you can see, Microsoft is retarded when it comes to this topic. It reminds me a bit of the story with IE, which web developers like so much. Haha. ;)
Hmm. Windows is known to follow their own standard. Chocolatey is Windows-only. Let's just follow what they 'say' is the standard. :tongue:
Just like this line ending thing. Linux: LF OSX: CR Windows: CRLF
Create a package in Linux and in Windows it's all messed up if you don't use the proper newline code. But I shouldn't be using Linux to create packages for Windows. :P
When you mention hosting packages elsewhere, when they are not intended for everyone, is it worth mentioning somewhere like MyGet.org? i.e. The package can still be on a public feed, but not polluting the Chocolatey.org feed?
Gary
You are replying to the character encoding wiki thing. I am not sure what you mean. Did you mean to reply to #355 ? Not sure what you mean there either. I'm sure other public feeds won't appreciate unacceptable packages either.
No BOM- it causes more issues than it helps. I find it weird that powershell would get ut wrong. Do you have a gist where it shows it outputting the detected encoding?
On Saturday, October 26, 2013, TomOne wrote:
The anchor is generated automatically as @ferventcoderhttps://github.com/ferventcoderalready mentioned: https://github.com/chocolatey/chocolatey/wiki/CreatePackages#character-encoding
Damn, I just found out that PowerShell doesn’t even recognize UTF-8 files without BOM. It uses Microsoft’s ancient Windows-1252 encoding by default. So we have to save PowerShell scripts in UTF-8 with BOM to work correctly. Sorry Microsoft, but we live in 2013. Nobody wants your stupid non-standardized encodings! [image: :-1:]
— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27150483 .
Rob "Be passionate in all you do"
http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder
@Redsandro You are right, my command is directed here:
https://github.com/chocolatey/chocolatey/issues/355
Sorry :)
Wrong, not the “no BOM” causes issues, Microsoft causes them by ignoring standards!
A gist is not necessary to show you PowerShell’s behavior. PowerShell simply assumes that all PowerShell scripts are in that crappy Windows-1252 encoding unless there’s a BOM. Then it simply reads the character addresses from the file and shows them if they were Windows-1252. Example:
test.ps1, encoded in UTF-8 without BOM:
Write-Host "αβγ © ™ ΩΣ"
Output in Powershell
In C:\Users\Someuser\test.ps1:1 Character:31
+ Write-Host "αβγ © ™ ΩΣ"
+ ~
The string has no terminator: ".
+ CategoryInfo : ParserError: (:) [], ParseException
+ FullyQualifiedErrorId : TerminatorExpectedAtEndOfString
I think the character encoding section in the Wiki is fine. It clearly explains what a person should do and clarifies the confusion that Microsoft has created. If someone doesn’t understand this, he’s probably retarded and shouldn’t create packages. ;)
And by the way, all modern Linux distributions use UTF-8 without BOM by default. The W3C doesn’t recommend BOMs for the web.
BOMs for UTF-8 really cause issues: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 Some issues that I’ve experienced myself when using UTF-8 with a BOM:
Well, Microsoft seems to insist on BOMs when using UTF-8. But only because Chocolatey is a Windows software, it doesn’t mean that it has to adapt Microsoft’s inability to follow standards and recommendations, especially when it comes to such fundamental things like character encoding.
I read that from multiple sources that Microsoft is not very open to any suggestions. For example, they keep ignoring bug reports and feature requests for Internet Explorer and automatically delete them after about 48 hours. What kind of weird strategy is this? Are they planning their downfall of their company? Well, at least for IE it seems to work, when I look on the progression of the market share of Internet Explorer. :grinning:
I mostly ignore things I can't change and change the things I can change.
Rob "Be passionate in all you do"
http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder
On Sat, Oct 26, 2013 at 3:31 PM, TomOne notifications@github.com wrote:
And by the way, all modern Linux distributions use UTF-8 without BOM by default. The W3C doesn’t recommend BOMs for the webhttp://www.w3.org/International/questions/qa-byte-order-mark.en.php .
BOMs for UTF-8 really cause issues: https://en.wikipedia.org/wiki/Byte_order_mark#UTF-8 Some issues that I’ve experienced myself when using UTF-8 with a BOM:
- .gitignore files don’t work properly. The expression in the first line fails because the BOM character stands in the way.
- A colleague created a small node.js script which merges multiple CSV files. The output file contained multiple BOM characters, even in the middle of a line. Then my colleague imported that into a database using phpMyAdmin. Then he experienced database inconsistencies. It took him over an hour until he found out that the BOMs were causing that issue.
Well, Microsoft seems to insist on BOMs when using UTF-8. But only because Chocolatey is a Windows software, it doesn’t mean that it has to adapt Microsoft’s inability to follow standards and recommendations, especially when it comes to such fundamental things like character encoding.
I read that from multiple sources that Microsoft is not very open to any suggestions. For example, they keep ignoring bug reports and feature requests for of Internet Explorer and automatically delete them after about 48 hours. What kind of weird strategy is this? Are they planning their downfall of their company? Well, at least for IE it seems to work, when I look on the progression of the market share of Internet Explorer. [image: :grinning:]
— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27155100 .
That's some wisdom right there. ;)
I mostly ignore things I can't change and change the things I can change.
Sounds pretty clear to me: We make chocolatey better by leaving the character encoding section as it is and let IE die. :grinning:
Sorry if I mention it, but your wisdom phrase has an inconsistency: How can you be sure that you can’t change something? If you take a closer look at history, you will notice great people who have seemingly done the impossible. But today we know that it wasn’t impossible. Thanks to this kind of people we now have democracy, women's suffrage and no more slavery. At least in the countries where we live.
edit: sorry, I meant something, not anything ;)
Isn't the problem solved by using the 'proper' charset? On Oct 26, 2013 11:52 PM, "TomOne" notifications@github.com wrote:
I mostly ignore things I can't change and change the things I can change.
Sounds pretty clear to me: We make chocolatey better by leaving the character encoding section as it is and let IE die. [image: :grinning:]
Sorry if I mention it, but your wisdom phrase has an inconsistency: How can you be sure that you can’t change anything? If you take a closer look at history, you will notice great people who have seemingly done the impossible. But today we know that it wasn’t impossible. Thanks to those people we now have democracy, women's suffrage and no more slavery. At least in the countries where we live.
— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27156620 .
Of course :) Simply UTF-8 without BOM for *.nuspec files and UTF-8 with BOM for PowerShell scripts.
Notepad++ incorrectly uses the term ANSI as UTF-8 for UTF-8 encoded files without a BOM.
What version are you running? I have no "ANSI as UTF-8" option, only "UTF-8 without BOM"
Version 6.2.2 Build Nov18 2012. (year old)
Open the file after saving it that way. That's what I think Tom is referring to.
On Sunday, October 27, 2013, Redsandro wrote:
Notepad++ incorrectly uses the term ANSI as UTF-8 for UTF-8 encoded files without a BOM.
What version are you running? I have no "ANSI as UTF-8" option, only "UTF-8 without BOM"
[image: Notepad++ encoding]https://github-camo.global.ssl.fastly.net/6e724cac3b3f0d47542fe3987f08a1934383d45f/687474703a2f2f692e696d6775722e636f6d2f525a445571674c2e706e67
— Reply to this email directly or view it on GitHubhttps://github.com/chocolatey/chocolatey/issues/294#issuecomment-27183337 .
Rob "Be passionate in all you do"
http://devlicio.us/blogs/rob_reynolds http://ferventcoder.com http://twitter.com/ferventcoder
Oh you're right! It goes back to ANSI.
Screenshot of Notepad++ 6.5: Notepad++ is inconsistent, that’s even worse. It’s also mentioned in these two bug reports:
@Redsandro, I think you misunderstood something here. When you open a file with no special characters, the encoding cannot be determined by the text editor. Then it depends on the settings of the editor which encoding is assumed. In Notepad++, you can override this setting if you tick “Apply to opened ANSI files” like in the screenshot above, which will assume UTF-8 for every file that has no special characters and no BOM. That’s the standard for every Linux editor I know, and it’s a good standard. :+1:
Sorry, I meant “which will assume UTF-8 for every file that has no non-ASCII characters and no BOM.”
I updated the title of this issue, because the old one was misleading and wrong.
Now I think we have every information needed to fix the character encoding problem. If you have no further doubts or questions, this issue can be closed.
Documentation is important, especially for new people. Who get a first impression. So I got a few suggestions if I may. :smile:
Suggested edit that implements all of these:
Either:
ASCII
characters. Or:
*.nuspec
and *.ps1
files. If you don’t respect this rule, some characters are not displayed correctly in the Gallery on Chocolatey.org, because the Gallery assumes UTF-8
.files with a Byte Order Mark** (BOM). A
BOMis neither required nor recommended for
UTF-8`, because it can lead to several issues.BOM
. PowerShell is ignoring the standards and needs a BOM
in order to recognize scripts as UTF-8
. Otherwise it processes non ASCII
characters incorrectly. UTF-8
files without BOM
. Alternatives:
UTF-8
encoding in the first line of your *.nuspec
files like so: <?xml version="1.0" encoding="utf-8"?>
.Note: There is a lot of confusion in the world of character encodings: For example, ANSI
is an incorrect term for the internal Windows character encodings, e.g. Windows-1252
. But you should not use this encoding family anyway. In addition, Notepad++ incorrectly uses the term ANSI as UTF-8 for UTF-8
encoded files without a BOM
. If you select UTF-8
in Notepad++, it means UTF-8 with BOM
. Therefore Notepad++ must show ANSI as UTF-8
in the statusbar.
Good suggestions, except this one:
Either: Do not use non-ASCII characters.
This is not a solution of the character encoding problem, only an amateurish circumvention for lazy people. I’m very annoyed of all these redcutions to typographically incorrect ASCII characters, like e' instead of è, ue instead of ü, ss instead of ß or (c) instead of ©.
I expect from an advanced computer user, programmer or package maintainer to know the important things about character encodings, or at least to be open to learn it. People that don’t like to read and follow rules, guidelines and standards, should never ever create and publish packages. Such people are harmful for the community and make the world to a worse place.
Don’t we all want packages with a good quality? Isn’t it worth to spend a few more minutes or a bit more to read, understand and follow the rules and guidelines of a project?
How about "Avoid the use of non-ASCII characters"?
And how about we update the templates so that they are in line with these recommendations?
Doesn’t that mean the same as “do not use”, just with other words? I definitely don’t like this recommendation. It’s like sweeping the dirt under the carpet. :-1: Seriously, what is so hard about reading a few lines and changing the character encoding settings of the editor if necessary?
I’ll make a pull request so that the templates comply with these recommendations.
Good suggestions, except this one
Well, I bet 95% of package maintainers (including myself) never gave it a thought and everything works as long as you stick to ASCII
characters, because the encoding doesn't really matter then.
So, in practice (practice trumps theory) it's a proper and within packages widely used solution until you are a seasoned packager and want to do crazy stuff like:
Update from 2013-10-29: Note that some information in this thread is wrong:
Original issue text: I discovered that the cpack command generates a nupkg which contains a nuspec file with the ANSI character encoding, even if the original nuspec file had UTF-8. ANSI is considered as outdated and deprecated. Here is a nice explanation of this topic: http://www.differencebetween.net/technology/protocols-formats/difference-between-ansi-and-utf-8/ The use of ANSI instead of UTF-8 destroys all characters in the Chocolatey gallery which are not saved in the first byte of the character encoding, for example the μ in μTorrent, the © and many other special characters. It seems that the NuGet.exe is responsible for this. Because of this currently I’m using the NuGet Package Explorer application to build my packages, which uses UTF-8. That works perfectly. I didn’t find any command line switches for NuGet.exe to force UTF-8.
Perhaps some of you think this is an irrelevant issue. But I think it is important to switch to UTF-8 in order to eliminate character encoding problems once and for all times. Almost all modern Linux distributions have already done so. Thumbs up for them. :)