Open GoogleCodeExporter opened 9 years ago
Thanks for spotting this issue.
That means, URLs are always encoded according to the CodePage of the document?
Original comment by OrphanCat
on 13 Nov 2012 at 6:42
No it's not that simple. You are not able to detect if it is UTF-8 or not
simply by something in the HTML-File. This depends alone of the webserver
and how the webapplication will handle this encodings. As a sample the
Delphi Webbroker does follow the RFC2396 rules. Also you must understand,
that we have a query part in the URL (part after the "?" char) where also
any spaces may be encoded to "+". This seems not be handled by the current
DecodeURL function. In the rest of the url or outside any query part it must
be %20.
Example:
http://meinweb.com/my%20Pictures/showimg?img=mein+sch%F6nes+bild.jpg
In this example the "img" query aprameter must be decoded to this "mein
schönes bild.jpg"
The tricky part will now be to detect if the URL contains any UTF-8 encoded
chars and follows RFC3986 or if it is RFC2396. My idea how it may be done
are as follow:
1. On decoding check if the first byte is a valid UTF-8 Leadbyte.
2. If the ASCII Value of the char is above $80 and the following char isn't
encoded at all it is for sure RFC2396. If the following char is encoded to
its not clear if it is UTF-8 or not. Here we must check for an UTF-8
Leadbyte again.
3. Try to decode it as RFC3986 and on any error, switch over to RFC2396.
IMO, a combination of the the first and third one for checking for a valid
UTF-8 Leadbyte and switching over to RFC2396 on any error, should be the
best solution.
Please see this page for some interesting ideas about this problem on
detecting UTF-8 without a BOM header:
http://www.delphigroups.info/2/4/581583.html
PS. For what exactly is this routine used at all in THMLViewer? Is it
importand to decode it to the correct URL? Must the query part of the URL be
decoded at all for the correct work of THtmlViewer component?
Original comment by r...@eicom.ch
on 14 Nov 2012 at 2:20
Here is a online URLEncode/Decode page where we have the option to set to RFC
2396 mode and to see the differences. As far I can see the "+" as a space is
onyl valid in RFC2396.
http://www.albionresearch.com/misc/urlencode.php
Original comment by r...@eicom.ch
on 14 Nov 2012 at 2:40
Rolf,
thanks for research.
It is most helpful.
OrphanCat
Original comment by OrphanCat
on 14 Nov 2012 at 6:41
Original issue reported on code.google.com by
r...@eicom.ch
on 13 Nov 2012 at 3:03