benibela / xidel

Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
http://www.videlibri.de/xidel.html
GNU General Public License v3.0
674 stars 42 forks source link

xidel skips "+" character when urldecode #111

Closed Baltazar500 closed 6 months ago

Baltazar500 commented 7 months ago

Hi.

xidel skips "+" character when urldecode :(

echo "%3D+%3D"|xidel -se 'uri-decode($raw)' = =

Xidel 0.9.9 (20231105.git59e545f2ae619d94dfc31fcbf518ebca352c342d)

Reino17 commented 7 months ago

https://en.wikipedia.org/wiki/Percent-encoding#The_application/x-www-form-urlencoded_type:

The encoding used by default is based on an early version of the general URI percent-encoding rules, with a number of modifications such as newline normalization and replacing spaces with + instead of %20.

Also see https://stackoverflow.com/questions/1634271/url-encoding-the-space-character-or-20.

So both decode to a \<space> and Xidel is correct.

$ xidel -se 'uri-encode("= =")'
%3D%20%3D

$ xidel -se '
  uri-decode("%3D%20%3D")
  uri-decode("%3D+%3D")
'
= =
= =
Baltazar500 commented 7 months ago

So both decode to a and Xidel is correct.

But the problem is that after urldecode/urlencode uri/path that contains the "+" character becomes incorrect

php/php-cli doesn't decode "+" into spaces :/

Reino17 commented 7 months ago

It's really hard to help you if you don't tell us exactly what your full input is and what output you're expecting.
If I'm guessing however, you'll probably need request-decode().

Baltazar500 commented 7 months ago

@Reino17, here is an example xml with uri/path containing "+"


<?xml version='1.0' encoding='utf-8' standalone='yes' ?>
    <map>
        <string name="%2Fstorage%2Fsdcard1%2FDownload%2Ffiles+folders.txt">250,250,1705803067359</string>
    </map>
</xml>

After utldecode the path becomes incorrect because the "+" symbol is skipped

Reino17 commented 7 months ago

https://stackoverflow.com/a/40292770:

A space may only be encoded to "+" in the [...] key-value pairs query part of an URL. [...] In the rest of URLs, it is encoded as %20.

"files+folders.txt" in your input is not the query part of an URL. It's the path.
If you expect the output to be /storage/sdcard1/Download/files+folders.txt, then the input is just simply wrong. The + instead should've been %20 %2F.