Closed G33kDude closed 4 years ago
Slower than the original? ;)
By far, but it's shorter and less buggy.
It may be possible to use a similar approach to the original, but tweak it slightly. Instead of using StringReplace, it would convert each in place through means such as Original := SubStr(Original, 1, pos-1) . Replacement . SubStr(Original, Pos+StrLen(Replacement))
, then making sure that the starting position of the regex is always past Pos+StrLen(Replacement)
. It should have the same speed bonus on normal text, but be less buggy.
According to my benchmarks, this version appears to work both faster and more accurately than what we're using now.
UriDecode(Uri) {
Pos := 1
While Pos := RegExMatch(Uri, "i)(%[\da-f]{2})+", Code, Pos)
{
VarSetCapacity(Var, StrLen(Code) // 3, 0), Code := SubStr(Code,2)
Loop, Parse, Code, `%
NumPut("0x" A_LoopField, Var, A_Index-1, "UChar")
Decoded := StrGet(&Var, "UTF-8")
Uri := SubStr(Uri, 1, Pos-1) . Decoded . SubStr(Uri, Pos+StrLen(Code)+1)
Pos += StrLen(Decoded)+1
}
Return, Uri
}
:+1:
the "fixed" version doesnt work for %D0%B5%D0%B1 meanwhile the one from first post here works great
@G33kDude Derp. Could you look into this if you have free time?
The input that @raszpl is a URL encoded string using the ANSI codepage, whereas our function uses the UTF-8 codepage. Our behavior is consistent with the relevant RFC. However, I can see some value in allowing the codepage to be specified for the decoder:
UriDecode(Uri, Encoding:="UTF-8") {
Pos := 1
While Pos := RegExMatch(Uri, "i)(%[\da-f]{2})+", Code, Pos)
{
VarSetCapacity(Var, StrLen(Code) // 3, 0), Code := SubStr(Code,2)
Loop, Parse, Code, `%
NumPut("0x" A_LoopField, Var, A_Index-1, "UChar")
Decoded := StrGet(&Var, Encoding)
Uri := SubStr(Uri, 1, Pos-1) . Decoded . SubStr(Uri, Pos+StrLen(Code)+1)
Pos += StrLen(Decoded)+1
}
Return, Uri
}
To decode the given sample, you would invoke the above function as such:
MsgBox, % UriDecode("%D0%B5%D0%B1", "CP0") ; еб
Great, thanks for looking into it!
its UTF-8, part of russian string, "еб" https://www.urlencoder.org returns %D0%B5%D0%B1 and back to "еб" in reverse direction
Maybe I should elaborate why I need this and how I stumbled onto the error. I am trying to pass URLs from javascript using windows command line. Sadly command line uses
chcp
Active code page: 437
by default? so I encode all strings to Base64String, unpacking that works fine, but then we are left with UTF8 encoded URI and LC_UriDecode fails. This on the other hand works fine
DecodeURI(Str)
{
Try
{
doc := ComObjCreate("HTMLfile")
doc.write("<body><script>document.write(decodeURIComponent(""" . Str . """));</script>")
Return, doc.body.innerText
}
}
If "еб" is the desired output, then the original code works fine on my system. What are you seeing as output?
Exact same script returns empty message box on my computer. Am I crazy? Windows 10 64bit, tested all three variants of AHK 1.1.32 ComObjCreate("HTMLfile") version works fine (obviously only in Unicode 32/64 variations) The one from G33kDude first post works for "%D0%B5%D0%B1", but try "%D0%B5%D0%B13600" and you notice a problem, returns "еб36" instead of "еб3600"
Are you saving your scripts in UTF8 with BOM?
notepad++ tried ansi, utf8-bom, even ucs-2 options, no change
Hmm works fine here:
What your environment info, ahk ,etc: https://github.com/joedf/AEI.ahk
SystemLocale : en-US (0x0409)
was using
AutoHotkey : v1.1.32.00 Unicode 32-bit (Portable)
updated to
AutoHotkey : v1.1.33.02 Unicode 64-bit (Installed)
no change
pasting
url := "%D0%B5%D0%B13600"
MsgBox,% LC_UriEncode(url)
MsgBox,% LC_UriDecode(url)
MsgBox,% LC_UrlEncode(url)
MsgBox,% LC_UrlDecode(url)
into https://raw.githubusercontent.com/ahkscript/libcrypt.ahk/master/src/URI.ahk results in printing
3600
3600
Edit: I found https://autohotkey.com/board/topic/17367-url-encoding-and-decoding-of-special-characters/page-2#entry735783 so this isnt a new problem
I still cannot reproduce your issue, however I can solve your bug with the original code from this post:
UriDecode(Uri)
{
VarSetCapacity(Out, StrLen(Uri)+1, 0), Ptr := &Out, i := 1
While (Char := SubStr(Uri, i++, 1)) != ""
NumPut(Char=="%"?"0x" SubStr(Uri,-2+i+=2,2):Asc(Char),Ptr++,"UChar")
Return, StrGet(&Out, "UTF-8")
}
Also, if you would like to continue using the htmlfile
COM object I would recommend code such as the following:
UriDecode(uri)
{
static dom := ComObjCreate("htmlfile"), _ := dom.write("<script></script>")
return dom.parentWindow.decodeURIComponent(uri)
}
For example,
LC_UriDecode("%2534")
returns4
instead of%34
. I've written a version of the function that doesn't have that problem, but it's roughly 2x slower by my benchmarks.