URL parsing doesn't handle UTF-8 characters

ghost commented 9 years ago

Forwarded from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=408126

I don't know where this is going but... urlscan doesn't handle URLs that contain UTF-8 characters. I just received an email which contains the following URL which urlscan fails to parse correctly: http://www.pantherhouse.com/newshelton/my-wife-thinks-i’m-a-swan/ When I paste it into Firefox directly, it successfully opens http://www.pantherhouse.com/newshelton/my-wife-thinks-i%E2%80%99m-a-swan/

I'm not quite sure how to handle this since it essentially means that virtually any character can appear in an URL. Maybe you have a good idea.

I'm not sure if UTF-8 characters should be supported in URLs. I think no but with recent IDN, maybe. firefox supports it without urlencoding it.

ghost commented 9 years ago

To reproduce:

echo '\nhttp://github.com/firecat53/ürlscan'|urlscan

firecat53 commented 9 years ago

I've got support for alphanumeric UTF-8 characters now in the utf8 branch. This works with the 2nd example above because the 'ü' is alphanumeric. However it doesn't work with the ’ character in your first example, because that's not an alphanumeric character.

Should characters like that in a URL even be supported? Should they just be URL-encoded?

I added a few more 'assert' "tests" in urlscan.py. Oddly enough the "I'm a swan" URL passes that assertion. I'll have to think about that some more, because I don't understand why yet.

See if you can break it with anything else :)

Thanks, Scott

firecat53 commented 9 years ago

Well, I think I've got it...except for the character in 'i’m-a-swan/'. It's not alphanumeric, so I think at this point, I'd rather not try to make a bunch of exceptions to handle it. Let me know if you have any other thoughts on that. Thanks!

firecat53 / urlscan

URL parsing doesn't handle UTF-8 characters #5