Closed joshuawn closed 8 years ago
Yep, the entire assignments process could do with some improvement (the current implementation is very naive)
I think I'm going to settle for downloading all non html files that are in the main page. I'll probably use some kind of solution similar to what you pasted
One other thing I've noticed with a few of the courses (such as androidapps101-002) is that they have custom assignment & preliminary setup pages in the course's /wiki/ section instead of linking directly through the sidebar. Explicitly iterating through all /wiki/ sub-URL's using download_all_zips_on_page would be a good idea, as otherwise some of these courses are devoid of hands-on content.
One possible (but slow) implementation is to create a 'wiki' folder, recursively scan every link on each page for the sub-string '[coursera course url]/wiki/', and then save each .html along with all supporting material by using the download_all_zips_on_page function. If a wiki page has already been visited, it can be skipped.
God the old Coursera platform is so unstandardized it's frustrating. I can almost see why they want to force everybody to switch.
I changed the checking for the file extension to using os.path.splitext.
I also added download_zips to all the sidebar links.
As I don't have much experience with Python, I must apologize in advance if my terminology doesn't quite match up with my implementation.
In the function download_all_zips_on_page(session, path='assignments'), the check for each hw_string parses through the entire unicode string and instantly queries a success on the first match. This is a problem if the URL contains one of these substrings before the very end.
For instance, if the URL == "www.python.org", the check erroneously confirms that the URL is a file due to the included substring ".py".
This might not cover all edge cases, but one way of parsing the URL as intended without causing inevitable IOErrors is to only check the very end of the URL: