IDN SITE_URL is not converted to Punycode

gerritsangel commented 9 years ago

The SITE_URL is not converted correctly to Punycode. For example, when initialising a new Blog and writing:

Site URL [http://getnikola.com/]: http://exämple.com/täst/ , this will result in conf.py to: SITE_URL = "http://ex\u00e4mple.com/t\u00e4st/"

Correct should be that the domain name is converted to Punycode: SITE_URL = "http://xn--exmple-cua.com/t\u00e4st/"

The result is that for example Firefox throws an error when clicking on the logo.

I guess that (only) the domain part needs to be isolated from SITE_URL and then converted with "exämple.com".encode("idna") to xn--exmple-cua.com.

Nikola should also keep in mind that the user may edit the SITE_URL in conf.py directly and write the IDN without punycode directly, so for example: SITE_URL = "http://exämple.com/täst" Therefore the Punycode convert should best be applied while building, not in the blog init.

ralsina commented 9 years ago

Interesting. It looks easy-ish :-)

Kwpolska commented 9 years ago

SITE_URL = "http://exämple.com/täst"

This is equivalent to:

SITE_URL = "http://ex\u00e4mple.com/t\u00e4st/"

To solve this, we could just urlsplit(), encode the domain part and urljoin() it back.

Kwpolska commented 9 years ago

PS. the issue is caused by a dumb algorithm (in lxml?) that is handling links like it’s 1999:

<a href="http://%D0%BF%D1%80%D0%B5%D0%B7%D0%B8%D0%B4%D0%B5%D0%BD%D1%82.%D1%80%D1%84/">broken</a>

This is the problem. Firefox wouldn’t mind the punycode form, and it also wouldn’t mind the real unicode:

<a href="http://xn--d1abbgf6aiiy.xn--p1ai/">works</a>
<a href="http://президент.рф/">works too</a>

On a side note, Chrome supports the percent-escaped link, IE and Safari also fail.

test: https://dl.dropboxusercontent.com/u/1933476/IDN.html

This is not a statement of support for the Russian Federation

gerritsangel commented 9 years ago

As a side node, maybe it would be good to write the variables in conf.py direclty in UTF-8. Escaping everything reduces readability to 0 and is not necessary, because conf.py's encoding is given as utf8 either way.

Kwpolska commented 9 years ago

Done in fb3a7db. Requires UTF-8 input for this to work.

gerritsangel commented 9 years ago

Found that the bug has a slightly larger impact when using Isso as a comment system. "script src" in the output html file will be incorrect and then the comment file is not loaded, and comments don't work.

Solution/workaround: Same as above, write the Domain in Punycode in COMMENT_SYSTEM_ID.

Kwpolska commented 9 years ago

It looks like the best solution would be to fix things in nikola init and require Punycode in the config file (don’t allow to build if Unicode is found in the domain).

gerritsangel commented 9 years ago

Well, in my (humble) opinion, it is best if Nikola would not convert the UTF-8 to anything: No Punycoding, no escaping. All current web browsers should understand URLs like http://президент.рф/президент.html. Therefore I don't see the necessity to convert this to http://xn--d1abbgf6aiiy.xn--p1ai/%D0%BF%D1%80%D0%B5%D0%B7%D0%B8%D0%B4%D0%B5%D0%BD%D1%82.html.

It would greatly improve readability if no conversions would be made. First, in conf.py: If you actually want to read what you have entered in SITE_URL, non-Punycode is better. The HTML sourcecode is, of course, not a problem, but this would be easier to read as well (and save some bytes :D)

... but of course I understand that this may need extreme overhaul.

Kwpolska commented 9 years ago

We can’t fix this on our own. lxml or one of their upstreams can — talk to the appropriate vendor if you want this fixed nicely.

gerritsangel commented 9 years ago

Ah ok, sorry for the misunderstanding :) I'll try it.

ralsina commented 9 years ago

@Kwpolska so, if I understand this correctly there's nothing more we can do? Close it?

Kwpolska commented 9 years ago

@ralsina Possible solutions include:

(a) trying to get the link replacer to fix this (which will probably not fix everything);
(b) fixing things in nikola init and warning users/failing to build if Unicode characters are encountered in SITE_URL.

Which one do we choose?

ralsina commented 9 years ago

I'd say b) which looks much easier.

Kwpolska commented 9 years ago

I tried to fix it with (a) and I failed. Not only did the aforementioned isso src links blow up, it also looks like the URL replacer does not touch the logo link and many others.

But, we could leave the patch in for when people want to link to IDN domain names and have Unicode input.

Fix in #1668.

getnikola / nikola

IDN SITE_URL is not converted to Punycode #1644