dgorissen / coursera-dl

A script for downloading course material (video's, pdfs, quizzes, etc) from coursera.org
http://dirkgorissen.com/2012/09/07/coursera-dl-a-coursera-download-script/
GNU General Public License v3.0
1.74k stars 299 forks source link

Too long path in Windows #8

Closed olegafx closed 11 years ago

olegafx commented 11 years ago

Some courses contains a materials with a very long path names.

Example: inforiskman-2012-001\08 - Week 7\08 - Business Continuity and Disaster Recovery Michael Ness, Part 1 - Leadership Selling Your Ideas (1542)\8 - 8 - Business Continuity and Disaster Recovery Michael Ness, Part 1 - Leadership Selling Your Ideas (1542).srt

dgorissen commented 11 years ago

Added a limit on the filename length set to roughly the OS limits. Are you actually getting an error or just finding such long filenames/paths annoying?

olegafx commented 11 years ago

Not solved:

Failed to download url https://class.coursera.org/algo2-2012-001/lecture/subtitles?q=43_en&format=txt to C:\Videos\Coursera\algo2-2012-001\02 - II. SELECTED REVIEW FROM PART I (Week 1)\03 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min)\2 - 3 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min).txt: [Errno 2] No such file or directory: 'C:\Videos\Coursera\algo2-2012-001\02 - II. SELECTED REVIEW FROM PART I (Week 1)\03 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min)\2 - 3 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min).txt'

Failed to download url https://class.coursera.org/algo2-2012-001/lecture/subtitles?q=43_en&format=srt to C:\Videos\Coursera\algo2-2012-001\02 - II. SELECTED REVIEW FROM PART I (Week 1)\03 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min)\2 - 3 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min).srt: [Errno 2] No such file or directory: 'C:\Videos\Coursera\algo2-2012-001\02 - II. SELECTED REVIEW FROM PART I (Week 1)\03 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min)\2 - 3 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min).srt'

Failed to download url https://class.coursera.org/algo2-2012-001/lecture/download.mp4?lecture_id=43 to C:\Videos\Coursera\algo2-2012-001\02 - II. SELECTED REVIEW FROM PART I (Week 1)\03 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min)\2 - 3 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min).mp4: [Errno 2] No such file or directory: 'C:\Videos\Coursera\algo2-2012-001\02 - II. SELECTED REVIEW FROM PART I (Week 1)\03 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min)\2 - 3 - Guiding Principles for Analysis of Algorithms [Part I Review - Optional](15 min).mp4'

Lubomir-Russia commented 11 years ago

The problem is that in Windows's realization of NTFS the limit of 260 characters is not for the file name but for the full path (for more details google the StackOverflow).

So this check len(fileName) < 260 will not prevent the download error. Unfortunately this check len(os.path.abspath(fileName)) < 260 will not help either, because if abspath is > 260 Windows will return only the fileName itself. Not sure if it is a bug or a feature of os.path.abspath()

This piece of code in sanitiseFileName can be a quick and dirty fix:

# ensure it is within a sane maximum
max = 250

fullFileNameLength = len(os.getcwd()) + len(s)
if (fullFileNameLength) > max:
    cutFileNameTail = fullFileNameLength - max
    print "    - The length of full file name is ", fullFileNameLength, " > max limit of ", max
    print "    - Original / shortened file name:"
    print s
    # split off extension, trim, and re-add the extension
    fn,ext = os.path.splitext(s)
    s = fn[:-(cutFileNameTail+len(ext))] + ext
    print s

return s

It is dirty because it is not OS aware and user configurable.

This patch is verified on the first weeks of Drugs and the Brain course.

dgorissen commented 11 years ago

Had a closer look and actually quite tricky to fix properly as there are lots of corner cases. Windows paths are a mess :) Will have another look later.

archit90 commented 11 years ago

isnt a better way to fix this is to download the lecture videos, srt in week's folder and the rest of lecture resources in a seperate folder for each lecture

dgorissen commented 11 years ago

not quite sure what you mean, feel free to clarify or propose a patch :)


Web / Blog : http://dirkgorissen.com Twitter : https://twitter.com/elazungu

On Sun, Apr 21, 2013 at 7:21 PM, Archit notifications@github.com wrote:

isnt a better way to fix this is to download the lecture videos, srt in week's folder and the rest of lecture resources in a seperate folder for each lecture

— Reply to this email directly or view it on GitHubhttps://github.com/dgorissen/coursera-dl/issues/8#issuecomment-16735519 .

rodch-us commented 11 years ago

The issue with long file name is proliferating even more. There is new course on writing2 and authors are taking poetic liberty with directory and filename. In first week itself just the directory name is running into 240+ characters, later week there may not be any space to create directory structure let alone files. Searching on internet choices are limited but if you do decide to adopt solution by libmor (mentioned above) then dgorissen please create a log file and put the information in there about old filename and new filename. The screen is already so verbose I don't even look at it anymore and in case we miss certain files because you have renamed original file and two files have same name, then at least we can look at logfile and figure out from that. Hopefully this can take care of 90% of such cases.

dgorissen commented 11 years ago

Annoying indeed. Hope to fix this, and redo the logging, but given my circumstances I have to say it may take a while. I will try my very best to fix bugs/crashes promptly though.


Web / Blog : http://dirkgorissen.com Twitter : https://twitter.com/elazungu

On Sun, Apr 28, 2013 at 8:25 AM, rodch-us notifications@github.com wrote:

The issue with long file name is proliferating even more. There is new course on writing2 and authors are taking poetic liberty with directory and filename. In first week itself just the directory name is running into 240+ characters, later week there may not be any space to create directory structure let alone files. Searching on internet choices are limited but if you do decide to adopt solution by libmor (mentioned above) then dgorissen please create a log file and put the information in there about old filename and new filename. The screen is already so verbose I don't even look at it anymore and in case we miss certain files because you have renamed original file and two files have same name, then at least we can look at logfile and figure out from that. Hopefully this can take care of 90% of such cases.

— Reply to this email directly or view it on GitHubhttps://github.com/dgorissen/coursera-dl/issues/8#issuecomment-17129479 .

dniku commented 11 years ago

Having the same problem with images-2012-001. Maybe too long file/directory names should be just trimmed?

juanse-dev commented 11 years ago

Yeah, i have this problem, too. I think the easiest solution would be as Pastafarianist says: long file/directory names should be trimmed

ilfats commented 11 years ago

Here's my version of a fix. It uses a parameter -t <max_path_length> and trims filenames in long paths to fit the specified max length. It does not trim path names as that would require more complex changes. This fix works for my ~20 courses, half of which had some path length issues.

diff --git a/courseradownloader/courseradownloader.py b/courseradownloader/courseradownloader.py
index 601d8cf..27062a9 100644
--- a/courseradownloader/courseradownloader.py
+++ b/courseradownloader/courseradownloader.py
@@ -42,7 +42,7 @@ class CourseraDownloader(object):
     # how long to try to open a URL before timing out
     TIMEOUT=60.0

-    def __init__(self,username,password,proxy=None,parser=DEFAULT_PARSER,ignorefiles=None):
+    def __init__(self,username,password,proxy=None,parser=DEFAULT_PARSER,ignorefiles=None, max_path_len=None):
         self.username = username
         self.password = password
         self.parser = parser
@@ -54,6 +54,7 @@ class CourseraDownloader(object):

         self.browser = None
         self.proxy = proxy
+        self.max_path_len = max_path_len

     def login(self,className):
         """
@@ -246,6 +247,32 @@ class CourseraDownloader(object):
         r = self.browser.open(url,timeout=self.TIMEOUT)
         return r.info()

+    def trimFileName(self, pathname):
+        """
+        Trim file name in given path name to fit max_path_len characters. Only file name is trimmed,
+        path names are not affected to avoid creating multiple folders for the same lecture.
+        """
+        MIN_LEN = 5  # Minimum length of file name to keep
+
+        if len(pathname) <= self.max_path_len:
+            return pathname
+
+        fpath, name = path.split(pathname)
+        name, ext = path.splitext(name)
+
+        to_cut = len(pathname) - self.max_path_len
+        to_keep = len(name) - to_cut
+
+        if to_keep < MIN_LEN:
+            print 'Cannot trim path name "%s" to fit required length (%d)' % (pathname, self.max_path_len)
+            return pathname
+
+        name = name[:to_keep]
+        new_pathname = path.join(fpath, name + ext)
+        print 'Trimmed path name "%s" to "%s" to fit required length (%d)' % (pathname, new_pathname, self.max_path_len)
+
+        return new_pathname
+
     def download(self, url, target_dir=".", target_fname=None):
         """
         Download the url to the given filename
@@ -270,6 +297,9 @@ class CourseraDownloader(object):

         filepath = path.join(target_dir,fname)

+        if self.max_path_len:
+            filepath = self.trimFileName(filepath)
+
         dl = True
         if path.exists(filepath):
             if clen > 0: 
@@ -567,6 +597,7 @@ def main():
                         default=False, help="download and save the sections in reverse order")
     parser.add_argument('course_names', nargs="+", metavar='<course name>',
                         type=str, help='one or more course names from the url (e.g., comnets-2012-001)')
+    parser.add_argument("-t", dest='max_path_len', type=int, help='attempt to trim path names to fit specified length, e.g. -t 259')
     args = parser.parse_args()

     # check the parser
@@ -593,8 +624,9 @@ def main():
             password = getpass.getpass()

     # instantiate the downloader class
-    d = CourseraDownloader(username,password,proxy=args.proxy,parser=html_parser,ignorefiles=args.ignorefiles)
-    
+    d = CourseraDownloader(username,password,proxy=args.proxy,parser=html_parser,ignorefiles=args.ignorefiles,
+        max_path_len=args.max_path_len)
+
     # authenticate, only need to do this once but need a classaname to get hold
     # of the csrf token, so simply pass the first one
     print "Logging in as '%s'..." % username
dgorissen commented 11 years ago

finally committed a fix, thanks in part to @ilfats. I dont have a windows machine here to fully test with but assuming its all ok. Reopen if further issues.