ahkscript / libcrypt.ahk

A collection of crypting and encoding functions from the community
MIT License
66 stars 19 forks source link

LC_UriDecode breaks on %25 #21

Closed G33kDude closed 4 years ago

G33kDude commented 9 years ago

For example, LC_UriDecode("%2534") returns 4 instead of %34. I've written a version of the function that doesn't have that problem, but it's roughly 2x slower by my benchmarks.

UriDecode(Uri)
{
    VarSetCapacity(Out, StrLen(Uri)+1, 0), Ptr := &Out, i := 1
    While Char := SubStr(Uri, i++, 1)
        NumPut(Char=="%"?"0x" SubStr(Uri,-2+i+=2,2):Asc(Char),Ptr++,"UChar")
    Return, StrGet(&Out, "UTF-8")
}
joedf commented 9 years ago

Slower than the original? ;)

G33kDude commented 9 years ago

By far, but it's shorter and less buggy.

G33kDude commented 9 years ago

It may be possible to use a similar approach to the original, but tweak it slightly. Instead of using StringReplace, it would convert each in place through means such as Original := SubStr(Original, 1, pos-1) . Replacement . SubStr(Original, Pos+StrLen(Replacement)), then making sure that the starting position of the regex is always past Pos+StrLen(Replacement). It should have the same speed bonus on normal text, but be less buggy.

G33kDude commented 9 years ago

According to my benchmarks, this version appears to work both faster and more accurately than what we're using now.

UriDecode(Uri) {
    Pos := 1
    While Pos := RegExMatch(Uri, "i)(%[\da-f]{2})+", Code, Pos)
    {
        VarSetCapacity(Var, StrLen(Code) // 3, 0), Code := SubStr(Code,2)
        Loop, Parse, Code, `%
            NumPut("0x" A_LoopField, Var, A_Index-1, "UChar")
        Decoded := StrGet(&Var, "UTF-8")
        Uri := SubStr(Uri, 1, Pos-1) . Decoded . SubStr(Uri, Pos+StrLen(Code)+1)
        Pos += StrLen(Decoded)+1
    }
    Return, Uri
}
joedf commented 9 years ago

:+1:

raszpl commented 4 years ago

the "fixed" version doesnt work for %D0%B5%D0%B1 meanwhile the one from first post here works great

joedf commented 4 years ago

@G33kDude Derp. Could you look into this if you have free time?

G33kDude commented 4 years ago

The input that @raszpl is a URL encoded string using the ANSI codepage, whereas our function uses the UTF-8 codepage. Our behavior is consistent with the relevant RFC. However, I can see some value in allowing the codepage to be specified for the decoder:

UriDecode(Uri, Encoding:="UTF-8") {
    Pos := 1
    While Pos := RegExMatch(Uri, "i)(%[\da-f]{2})+", Code, Pos)
    {
        VarSetCapacity(Var, StrLen(Code) // 3, 0), Code := SubStr(Code,2)
        Loop, Parse, Code, `%
            NumPut("0x" A_LoopField, Var, A_Index-1, "UChar")
        Decoded := StrGet(&Var, Encoding)
        Uri := SubStr(Uri, 1, Pos-1) . Decoded . SubStr(Uri, Pos+StrLen(Code)+1)
        Pos += StrLen(Decoded)+1
    }
    Return, Uri
}

To decode the given sample, you would invoke the above function as such:

MsgBox, % UriDecode("%D0%B5%D0%B1", "CP0") ; еб
joedf commented 4 years ago

Great, thanks for looking into it!

raszpl commented 4 years ago

its UTF-8, part of russian string, "еб" https://www.urlencoder.org returns %D0%B5%D0%B1 and back to "еб" in reverse direction

Maybe I should elaborate why I need this and how I stumbled onto the error. I am trying to pass URLs from javascript using windows command line. Sadly command line uses

chcp
Active code page: 437

by default? so I encode all strings to Base64String, unpacking that works fine, but then we are left with UTF8 encoded URI and LC_UriDecode fails. This on the other hand works fine

DecodeURI(Str)
{
    Try
    {
        doc := ComObjCreate("HTMLfile")
        doc.write("<body><script>document.write(decodeURIComponent(""" . Str . """));</script>")
        Return, doc.body.innerText
    }
}
G33kDude commented 4 years ago

If "еб" is the desired output, then the original code works fine on my system. What are you seeing as output?

image

raszpl commented 4 years ago

Exact same script returns empty message box on my computer. Am I crazy? Windows 10 64bit, tested all three variants of AHK 1.1.32 ComObjCreate("HTMLfile") version works fine (obviously only in Unicode 32/64 variations) The one from G33kDude first post works for "%D0%B5%D0%B1", but try "%D0%B5%D0%B13600" and you notice a problem, returns "еб36" instead of "еб3600"

joedf commented 4 years ago

Are you saving your scripts in UTF8 with BOM?

raszpl commented 4 years ago

notepad++ tried ansi, utf8-bom, even ucs-2 options, no change

joedf commented 4 years ago

Hmm works fine here: image

What your environment info, ahk ,etc: https://github.com/joedf/AEI.ahk

raszpl commented 4 years ago
SystemLocale :          en-US (0x0409)

was using

AutoHotkey :            v1.1.32.00 Unicode 32-bit (Portable)

updated to

AutoHotkey :            v1.1.33.02 Unicode 64-bit (Installed)

no change

pasting

url := "%D0%B5%D0%B13600"
MsgBox,% LC_UriEncode(url)
MsgBox,% LC_UriDecode(url)
MsgBox,% LC_UrlEncode(url)
MsgBox,% LC_UrlDecode(url)

into https://raw.githubusercontent.com/ahkscript/libcrypt.ahk/master/src/URI.ahk results in printing

3600

3600

Edit: I found https://autohotkey.com/board/topic/17367-url-encoding-and-decoding-of-special-characters/page-2#entry735783 so this isnt a new problem

G33kDude commented 4 years ago

I still cannot reproduce your issue, however I can solve your bug with the original code from this post:

UriDecode(Uri)
{
    VarSetCapacity(Out, StrLen(Uri)+1, 0), Ptr := &Out, i := 1
    While (Char := SubStr(Uri, i++, 1)) != ""
        NumPut(Char=="%"?"0x" SubStr(Uri,-2+i+=2,2):Asc(Char),Ptr++,"UChar")
    Return, StrGet(&Out, "UTF-8")
}

Also, if you would like to continue using the htmlfile COM object I would recommend code such as the following:

UriDecode(uri)
{
    static dom := ComObjCreate("htmlfile"), _ := dom.write("<script></script>")
    return dom.parentWindow.decodeURIComponent(uri)
}