issue on Windows, filename containing ":"

coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)

GNU Lesser General Public License v3.0

1.92k stars 638 forks source link

issue on Windows, filename containing ":" #333

Open charle-y opened 8 years ago

charle-y commented 8 years ago

As an example, it looks like:

https://courses.edx.org/asset-v1:HarvardX+SW12.1x+2015+type@asset+block@ChinaX_Pomerantz_Part_1_4-6.pdf => Downloaded\China_Part_10-_Greater_China_Today-_The_Peoples__Republic_Taiwan_and_Hong_Kong\24-Notes_Sharing\01-asset-v1:HarvardX+SW12.1x+2015+type@asset+block@ChinaX_Pomerantz_Part_1_4-6.pdf

the file with name "01-asset-v1" is created, but it is 0KB. I am using Windows 10, I think it is a common issue in all Windows system. Thanks in advance.

iemejia commented 8 years ago

Can somebody with access to windows help us check this one ?

hwasiti commented 8 years ago

The same problem here. Windows prohibits to name any file containing special characters like : \ / * ? " < > |

edx-dl should be able to omit such characters from file names.

If you need any help to check it out after any modification to the source code, I am happy to do so.

This course has such pdf files (asset): https://courses.edx.org/courses/course-v1:OsakaUx+CNR101x+1T2016/info

edx-dl tried this: [download] http://courses.edx.org/asset-v1:OsakaUx+CNR101x+1T2016+type@asset+block@osakaux_cnr101x_wk1_handout.pdf => J:\Edx\Cognitive_Neuroscience_Robotics__Part_A\07-Weekly_Handout\01-asset-v1:OsakaUx+CNR101x+1T2016+type@asset+block@osakaux_cnr101x_wk1_handout.pdf

But only a file named "01-asset-v1" has been downloaded with 0 byte size.

xunilrj commented 6 years ago

I have a possible solution. What do you think?

usage:

./edx.py --sanitize-filename ":\/*?<>"

https://github.com/coursera-dl/edx-dl/blob/master/edx_dl/edx_dl.py#L683

def _build_filename_from_url(args, url, target_dir, filename_prefix):
    """
    Builds the appropriate filename for the given args
    """
    if is_youtube_url(url):
        filename = filename_prefix + "-%(title)s-%(id)s.%(ext)s"        
    else:
        original_filename = url.rsplit('/', 1)[1]
        filename = filename_prefix + '-' + original_filename

    #https://stackoverflow.com/a/38748649/5397116
    def remove(str_, chars):
        try:
            # Python2.x
            return str_.translate(None, chars)
        except TypeError:
            # Python 3.x
            table = {ord(char): None for char in chars}
            return str_.translate(table)

    if args.sanitize_filename:
        filename = remove(filename, args.sanitize_filename)

    filename = os.path.join(target_dir, filename)
    return filename

rbrito commented 6 years ago

@xunilrj, please, send a pull request with your solution so that we can better evaluate this, if this is still a problem.

BTW, youtube-dl already has a sanitization facitily built in and you probably want to use that (or adapt what is there).