Patiencer / thtmlviewer

Automatically exported from code.google.com/p/thtmlviewer
0 stars 0 forks source link

URLSubs.DecodeURL broken #216

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
Which steps will reproduce the problem?
1. Open URLSub.pas and got to routine DecodeURL
2.
3.

What is the expected output? What do you see instead?

This type of URL isn't anymore correct handled by the above routine and throws 
exceptions.

http:\\www.myweb.de\loadHTML?UID=1009_113B3F03&MyTitle=%DCber+uns

For some reason, this routine tries all the time to convert this to UTF-8 but 
this is wrong!!! It must only convet it if the URL is in UTF8 at all, but in 
the real world situation I have never seen a such URL. Also the referenced RFC 
doesn't realy say it is UTF-8 all the time. It say that if there is a Unicode 
char which must be UTF-8 encoded, the utf-8 Code pair must be %-encoded if the 
URL scheme of the underlaying webserver does need UTF-8. A %-coded char is all 
the time a single byte char. 

In the above axample the webser applicaiton needs an Ansi string and not a 
UTF-8 or Unicodestring. 

This comment in the URLSubs.pas file is wrong!
"According to http://tools.ietf.org/html/rfc3986 percent encoded data is in 
UTF-8"
Only unicode chars are UTF-8 encoded. This routine must follow the "old" 
RFC2396 rules and also to the newer one, but only to support the new UTF-8 
rulse will break thousends of thousand of exisitng URL's in the web.

Which version of the product are you using? Which compiler version are you
using? On which operating system?

Delphi XE3
Latest version from svc (11.4?)
Windows 7 64 Bit

Please attach test html files and screenshots, if appropriate.
Please provide any additional information:

Original issue reported on code.google.com by r...@eicom.ch on 13 Nov 2012 at 3:03

GoogleCodeExporter commented 9 years ago
Thanks for spotting this issue.

That means, URLs are always encoded according to the CodePage of the document?

Original comment by OrphanCat on 13 Nov 2012 at 6:42

GoogleCodeExporter commented 9 years ago
No it's not that simple. You are not able to detect if it is UTF-8 or not 
simply by something in the HTML-File. This depends alone of the webserver 
and how the webapplication will handle this encodings. As a sample the 
Delphi Webbroker does follow the RFC2396 rules. Also you must understand, 
that we have a query part in the URL (part after the "?" char) where also 
any spaces may be encoded to "+". This seems not be handled by the current 
DecodeURL function. In the rest of the url or outside any query part it must 
be %20.

Example:
http://meinweb.com/my%20Pictures/showimg?img=mein+sch%F6nes+bild.jpg

In this example the "img" query aprameter must be decoded to this "mein 
schönes bild.jpg"

The tricky part will now be to detect if the URL contains any UTF-8 encoded 
chars and follows RFC3986 or if it is RFC2396. My idea how it may be done 
are as follow:

1. On decoding check if the first byte is a valid UTF-8 Leadbyte.
2. If the ASCII Value of the char is above $80 and the following char isn't 
encoded at all it is for sure RFC2396. If the following char is encoded to 
its not clear if it is UTF-8 or not. Here we must check for an UTF-8 
Leadbyte again.
3. Try to decode it as RFC3986 and on any error, switch over to RFC2396.

IMO, a combination of the the first and third one for checking for a valid 
UTF-8 Leadbyte and switching over to RFC2396 on any error, should be the 
best solution.

Please see this page for some interesting ideas about this problem on 
detecting UTF-8 without a BOM header: 
http://www.delphigroups.info/2/4/581583.html

PS. For what exactly is this routine used at all in THMLViewer? Is it 
importand to decode it to the correct URL? Must the query part of the URL be 
decoded at all for the correct work of THtmlViewer component?

Original comment by r...@eicom.ch on 14 Nov 2012 at 2:20

GoogleCodeExporter commented 9 years ago
Here is a online URLEncode/Decode page where we have the option to set to RFC 
2396 mode and to see the differences. As far I can see the "+" as a space is 
onyl valid in RFC2396.

http://www.albionresearch.com/misc/urlencode.php

Original comment by r...@eicom.ch on 14 Nov 2012 at 2:40

GoogleCodeExporter commented 9 years ago
Rolf,

thanks for research. 
It is most helpful.

OrphanCat

Original comment by OrphanCat on 14 Nov 2012 at 6:41