Closed ghost closed 9 years ago
To reproduce:
echo '\nhttp://github.com/firecat53/ürlscan'|urlscan
I've got support for alphanumeric UTF-8 characters now in the utf8 branch. This works with the 2nd example above because the 'ü' is alphanumeric. However it doesn't work with the ’ character in your first example, because that's not an alphanumeric character.
Should characters like that in a URL even be supported? Should they just be URL-encoded?
I added a few more 'assert' "tests" in urlscan.py. Oddly enough the "I'm a swan" URL passes that assertion. I'll have to think about that some more, because I don't understand why yet.
See if you can break it with anything else :)
Thanks, Scott
Well, I think I've got it...except for the character in 'i’m-a-swan/'. It's not alphanumeric, so I think at this point, I'd rather not try to make a bunch of exceptions to handle it. Let me know if you have any other thoughts on that. Thanks!
Forwarded from https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=408126
I don't know where this is going but... urlscan doesn't handle URLs that contain UTF-8 characters. I just received an email which contains the following URL which urlscan fails to parse correctly: http://www.pantherhouse.com/newshelton/my-wife-thinks-i’m-a-swan/ When I paste it into Firefox directly, it successfully opens http://www.pantherhouse.com/newshelton/my-wife-thinks-i%E2%80%99m-a-swan/
I'm not quite sure how to handle this since it essentially means that virtually any character can appear in an URL. Maybe you have a good idea.
I'm not sure if UTF-8 characters should be supported in URLs. I think no but with recent IDN, maybe. firefox supports it without urlencoding it.