KoalaBear84 / OpenDirectoryDownloader

Indexes open directories
GNU General Public License v3.0
1.14k stars 92 forks source link

Output URL are not correctly encoded #142

Open maaaaz opened 3 weeks ago

maaaaz commented 3 weeks ago

Hello there,

I observe that even the latest current version of ODD (v3.1.0.1) does not properly encode URL in the output file.

Let me detail the case:

  1. First, let's ODD a (randomly found on the internet) website containing some special chars in the path:

    $ ./OpenDirectoryDownloader -u "https://gregoirelorieux.net/paysagescomposes/villes/Melle/" --output-file test
    [...]
    Finshed indexing
    [...]
    Saving URL list to file..
    Saved URL list to file: /tmp/test.txt
  2. Then let's see the first results of the output file:

    $ head test.txt
    https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif
    [...]
  3. If we try to download the first file with wget (and even other download managers), it fails because there are unencoded characters in the URL: "#" and whitespaces.

    
    $ wget -v "https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif"
    --2024-10-29 23:22:12--  https://gregoirelorieux.net/paysagescomposes/villes/Melle/
    Resolving gregoirelorieux.net (gregoirelorieux.net)... 213.186.33.87
    Connecting to gregoirelorieux.net (gregoirelorieux.net)|213.186.33.87|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 844 [text/html]
    Saving to: ‘index.html’

index.html 100%[===============================================================================>] 844 --.-KB/s in 0s

2024-10-29 23:22:13 (550 MB/s) - ‘index.html’ saved [844/844]


Here, the downloaded file:
* **is not** the asked one: `https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif`
* **but is from this automatically split link**: `https://gregoirelorieux.net/paysagescomposes/villes/Melle/`
`wget` ignores everything after finding a special char, the first one here is "#"  

**The correct encoded link in the ODD output file should be**: 
 `https://gregoirelorieux.net/paysagescomposes/villes/Melle/%233%2021%20jan/Melle/contrebasse-echantillons/cb-arco-1.aif`  

Instead of:  
`https://gregoirelorieux.net/paysagescomposes/villes/Melle/#3 21 jan/Melle/contrebasse-echantillons/cb-arco-1.aif`

Can you fix it ?

The [`encodeURIComponent`](https://developer.mozilla.org/fr/docs/Web/JavaScript/Reference/Global_Objects/encodeURIComponent) function should help.

Cheers! 
KoalaBear84 commented 3 weeks ago

Hi, thanks for letting me know. I'll try to look at it ASAP 😅

KoalaBear84 commented 3 weeks ago

Tried to make a new version with a partial fix, and maybe the definitive fix for now. But GitHub wont let me anymore because they deprecated/disabled older build actions. Will continue another time..

Chaphasilor commented 3 weeks ago

I think this should be optional (but maybe the default). I've encountered servers in the past that didn't treat encoded URLs the same as the raw URL, seemingly becaus they didn't decode them (or not correctly). Improving the parsing in the downloadet itself, or manually passing an enquoted URL to it, should work even with URLs that aren't encoded.

maaaaz commented 3 weeks ago

I think this should be default, as download managers do not support unencoded URLs.

In the meantime, a Python solution to properly encode ODD output file:

$ cat script.py
#!/usr/bin/python3

import sys
import urllib.parse

for line in sys.stdin:
    print(urllib.parse.quote(line.strip(), safe=':/'))

$ cat oddresult | python script.py