Chillee / coursera-dl-all

MIT License
189 stars 54 forks source link

Check for extension in download_all_zips_on_page is naive #1

Closed joshuawn closed 8 years ago

joshuawn commented 8 years ago

As I don't have much experience with Python, I must apologize in advance if my terminology doesn't quite match up with my implementation.

In the function download_all_zips_on_page(session, path='assignments'), the check for each hw_string parses through the entire unicode string and instantly queries a success on the first match. This is a problem if the URL contains one of these substrings before the very end.

For instance, if the URL == "www.python.org", the check erroneously confirms that the URL is a file due to the included substring ".py".

This might not cover all edge cases, but one way of parsing the URL as intended without causing inevitable IOErrors is to only check the very end of the URL:

    for i in links:
        url = i.get_attribute('href')
        if url==None:
            continue
        txt_file.write(url+'\n')
        hw_strings = [u'.zip', u'.py', u'.m', u'.pdf']
        is_hw = False

        for j in hw_strings:
            #has additional slash at the end. must be spliced off.
            if (url[len(url)-1] == u'/'):
                url = url[:(len(url)-1)]

            if (url[(len(url)-len(j)):len(url)] == j):
                is_hw = True
                continue
Chillee commented 8 years ago

Yep, the entire assignments process could do with some improvement (the current implementation is very naive)

I think I'm going to settle for downloading all non html files that are in the main page. I'll probably use some kind of solution similar to what you pasted

joshuawn commented 8 years ago

One other thing I've noticed with a few of the courses (such as androidapps101-002) is that they have custom assignment & preliminary setup pages in the course's /wiki/ section instead of linking directly through the sidebar. Explicitly iterating through all /wiki/ sub-URL's using download_all_zips_on_page would be a good idea, as otherwise some of these courses are devoid of hands-on content.

One possible (but slow) implementation is to create a 'wiki' folder, recursively scan every link on each page for the sub-string '[coursera course url]/wiki/', and then save each .html along with all supporting material by using the download_all_zips_on_page function. If a wiki page has already been visited, it can be skipped.

Chillee commented 8 years ago

God the old Coursera platform is so unstandardized it's frustrating. I can almost see why they want to force everybody to switch.

Chillee commented 8 years ago

I changed the checking for the file extension to using os.path.splitext.

I also added download_zips to all the sidebar links.