ckan / ckanext-archiver

Archive CKAN resources
MIT License
21 stars 46 forks source link

No handling for encoded URLs #46

Open KrzysztofMadejski opened 7 years ago

KrzysztofMadejski commented 7 years ago

I have run into this issue here: https://danepubliczne.gov.pl/dataset/informacja-kwartalna-o-stanie-finansow-publicznych/resource/86454cff-556a-4162-aa65-433158c133f4

Basically the provider has linked external resource as: http://www.mf.gov.pl/documents/764034/1002163/Informacja+kwartalna++III+kwarta%C5%82+2016+r.. To make it more clear let's assume the filename is kwarta%C5%82+2016

This file is saved to disk as is, meaning kwarta%C5%82+2016. It is then served by Apache escaping percents: kwarta%25C5%2582+2016 while CKAN links archived version as in orginal URL kwarta%C5%82+2016. That leads to 404 error on the archived link.

I think we should decode any incoming urls (below) or erase all encoded chars. What do you think?

    # ckanext/archiver/tasks.py:556
    try:
        file_name = parsed_url.path.split('/')[-1] or 'resource'
        file_name = urllib.unquote(file_name) # DECODING ADDED HERE
        file_name = file_name.strip()  # trailing spaces cause problems
        file_name = file_name.encode('ascii', 'ignore')  # e.g. u'\xa3' signs
thorge commented 11 months ago

The archiver extension in CKAN appears to be unintentionally double percent-encoding URLs that are already percent encoded. For instance, a URL path like kwarta%C5%82+2016 is already percent-encoded, but the archiver extension is converting it to kwarta%25C5%2582+2016, causing issues.

According to RFC 3986:

"Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts. This is when an implementation determines which of the reserved characters are to be used as subcomponent delimiters and which can be safely used as data. Once produced, a URI is always in its percent-encoded form.

[...]

Because the percent ("%") character serves as the indicator for percent-encoded octets, it must be percent-encoded as "%25" for that octet to be used as data within a URI. Implementations must not percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string."

This means that your suggestion of always decoding incoming URLs is not in compliance with RFC standards. Instead, the percent character ("%") should be used as an indicator to determine whether decoding needs to be performed.

It's also worth considering related discussions in issue #91 for additional context and potential solutions to this problem.