dgorissen / coursera-dl

A script for downloading course material (video's, pdfs, quizzes, etc) from coursera.org
http://dirkgorissen.com/2012/09/07/coursera-dl-a-coursera-download-script/
GNU General Public License v3.0
1.73k stars 299 forks source link

Requests and Progress Bar #66

Closed lsoliveira459 closed 10 years ago

lsoliveira459 commented 10 years ago

In an attempt to monitor whether the program halted or not, I used Requests instead of Mechanize to download manage the connections. Using Requests, I added a small progress bar to help monitor if the download progress halted.

It is completely working in Windows 8, but I couldn't test it on W7 or Linux. Below is a small proof it's working.

untitled

dgorissen commented 10 years ago

Thanks very much for this. Will look to go through it and merge by the weekend.

dgorissen commented 10 years ago

Thanks again. Had a quick try but fails for me with the "did you accept the honour code" error on one of my classes (malsoftware-001). Works fine if I switch back to my master branch. Also, why the extra TOKEN_URL and why hardcoded to a specific class (ml)?

lsoliveira459 commented 10 years ago

I'm assuming you accepted the honour code. Is there anything special on this course that would trigger this? I'm a little clue-less about this error since I don't recall fooling around this exception.

Getting to TOKEN_URL gives us access to that "csrf_token". It's hardcoded simply to avoid having to find a URL dynamically (a solution I provided commented-out). This ML class, just as informative comment, was the first class offered by coursera.

dgorissen commented 10 years ago

There is nothing special about that course. Looking closely it turns out that coursera replies with "Please use a modern browser with JavaScript enabled to use Coursera." Sporadically I also get:

line 259, in get_page page = response.content AttributeError: 'NoneType' object has no attribute 'content'

Switching back to my master branch and it all works fine. So there is a difference with how mechanize makes requests and the requests lib which coursera.org is not liking. At least for me on OSX and linux with python 2.7.3.

Wrt the token_url, I get that but still dont see the need for it. Why not use the class name as passed by the user (as per the original code). This prevents possible breakage if the ml class disappears or gets renamed.

lsoliveira459 commented 10 years ago

There is nothing special about that course. Looking closely it turns out that coursera replies with "Please use a modern browser with JavaScript enabled to use Coursera." Sporadically I also get:

line 259, in get_page page = response.content AttributeError: 'NoneType' object has no attribute 'content'

I'm a bit busy right now too so I'll to look into that a little later. Maybe next weekend.

Switching back to my master branch and it all works fine. So there is a difference with how mechanize makes requests and the requests lib which coursera.org is not liking. At least for me on OSX and linux with python 2.7.3.

I wouldn't say it's something about coursera but just in case I'll identify the requests as coming from Firefox or something and check if there's anything the request leaves behind after closing the handler. If that's an OS problem I can't see how I could look into it.

"Wrt the token_url, I get that but still dont see the need for it. Why not use the class name as passed by the user (as per the original code). This prevents possible breakage if the ml class disappears or gets renamed."

And what about the user mistyping the first class' name? The commented code I included, repeated below with a few notes, gets a JSON with information to all the classes (found by inspection) and slowly loads it (128 bytes at a time) searching for a link to a class' page. I thought this to be the most robust approach, but also a bit slower. I left it as a contingency plan for the case you mentioned.

# Estabilish a keep-alive connection
all_classes_json = requests.get(CLASSES_URL,stream=True)

if(int(all_classes_json.status_code) == 200):
    str = ''
    # Create a iterator on the connection to retrieve 128 bytes at a time
    it = all_classes_json.iter_content(128)
    while 1:
        str += it.next()
        JSON_TOKEN = '"preview_link"'
        try:
            # Getting indexes that contain the URL we want
            i1 = str.index(JSON_TOKEN)
            i2 = str.index('"',i1+len(JSON_TOKEN)+2)
        except:
            # In case the information was not loaded yet, grab more 128 and retry
            continue
        else:
            # All's fine
            TOKEN_URL = str[i1+len(JSON_TOKEN)+2:i2]
            all_classes_json.close()
            break
else:
    print 'Please make sure you are connected to the internet.'
lsoliveira459 commented 10 years ago

I replicated the error. I'll look into it now.

Have you seen this? (https://www.facebook.com/Coursera/posts/439625136155480)

dgorissen commented 10 years ago

closed after merging #100