dgorissen / coursera-dl

A script for downloading course material (video's, pdfs, quizzes, etc) from coursera.org
http://dirkgorissen.com/2012/09/07/coursera-dl-a-coursera-download-script/
GNU General Public License v3.0
1.73k stars 299 forks source link

Downloading download.mp4 #126

Closed qujiey closed 9 years ago

qujiey commented 10 years ago

Hi,

For some reasons, it seems the script will download some files as download.mp4 filenames

martinakos commented 10 years ago

This happens to me too.

dgorissen commented 10 years ago

Aware of this but unfortunately not something I have the time to dig into now. If you re-download does it happen consistently? (it should not)

Robottinosino commented 10 years ago

Happens to me too

Tino Sino

On 17 Mar 2014, at 22:33, martinakos notifications@github.com wrote:

This happens to me too.

— Reply to this email directly or view it on GitHub.

dgorissen commented 10 years ago

this has also come up on #100 @xu-cheng did you look into this further by any chance?

xu-cheng commented 10 years ago

@dgorissen I think it has nothing to do with that. It relates the process to get file name.

danmbox commented 10 years ago

For me it seems to happen when re-downloading, and only for some of the files

pr-ravi commented 10 years ago

Happens when re - downloading, and randomly on files. Also a negative percentage is shown, eg, status: -629145600% 68.4 KB/s

fmjrey commented 10 years ago

I can confirm this annoying bug happens when redownloading too, and percent value is a high negative value.

fmjrey commented 10 years ago

Just had an occurence where download.mp4 was not a duplicate of an already downloaded file. Maybe it was because last night I had to shutdown my machine before it finished downloading, and that I started the process again today, giving me even more download.mp4 files.

dgorissen commented 10 years ago

Again; sorry guys. I know this is an annoying issue. It came out of the port to requests away from mechanize. There is still a mechanize branch to use if this really is a problem but the branch itself is a bit outdated.

Better would be a pull request fixing this, I unfortunately no longer have the bandwidth.

danmbox commented 10 years ago

Is there any reason at all to stick to "requests"? Was there any real problem with mechanize?

dgorissen commented 10 years ago

This was done by @xu-cheng, resulted in python 3 compatibility and also a cleaner codebase.

danmbox commented 10 years ago

Those are bad reasons to break functionality in any project. Python3 in particular seems to have both low adoption rates and serious design flaws, see e.g. http://lucumr.pocoo.org/2014/5/12/everything-about-unicode/

I've recently had broken mp4's with HTTP headers inside, so maybe enough is enough. I'm surprised you accepted @xu-cheng's changes so easily on the trunk in fact when so many questions remained unanswered and even you noticed broken DLs in #100

xu-cheng commented 10 years ago

@danmbox all of these problems are NOT the result of using requests. The problem comes from coursera's end. At the time of my pull requests are merged, it works fine. The reason that moving to requests is essential is that mechanize has seized developing (last changed is two years ago with lots of unsolved issues) and python 3 support. Sorry, I don't agree in your opinion in Python 3. In fact Python 3 has lots of design advantages over Python 2, if you were really looking into the tech details. As the low adoption, it's the old story. As you should know, there will no be a Python 2.8. Python 2.7 is the end. And most of important Python packages have moved to Python 3. Fedora and other linux distro have set Python 3 as default Python interpreter. In addition, Python 3's unicode support is an important factor here when you deal with non-English coursera classes.

@dgorissen to solve all of these once for all, I suggest that we drop filename from HTTP header. Instead, we name the file from what we get in index.html. Therefore, it will have no naming issue, better support for Windows path length limitation and can be a huge help to connection robust. (Current problem mainly comes from GET method to get file name). If it's ok to you, I will do these in my free time, and do some experiments to check any changes from coursera's end.

danmbox commented 10 years ago

@xu-cheng the main problem is not the borked filenames, it's the broken file contents. I see HTTP headers inside files! I also see files that seem put together from chunks in the wrong order (PDF starts with 2nd half of the original file, then HTTP headers, then the 1st half).

Have you looked at what requests does under the hood? It even has its own connection pool. Perhaps setting some parameters to "safe" values would work better (e.g. single connection, close after each download)?

Alternatively, perhaps mechanize works well enough that it doesn't need major changes? Maybe we can have a simple abstraction layer if requests isn't up to the job yet, but will be in the future?

As far as py3, my point was that sacrificing functionality for python3 compatibility is a bad idea. I don't want to start flamewar, but Is there anything unicode-related that py2 cannot handle? Python3 tries to force unicode strings everywhere, even though some things are naturally byte arrays with no clear encoding -- e.g. in UNIX filenames (the same mistake that Java made a while ago). That's what the rant I linked to was highlighting (read the python3 cat example to see what writing a correct application involves; of course it's easy to write buggy apps that frustrate users).

xu-cheng commented 10 years ago

@danmbox Of course, I look deep into what requests does. The problem of broken file contents is the result of strange connection of coursera's end. And part of it is the result of current filename getting method. Now the code would try to get file name first through HTTP header. And coursera's server wouldn't return filename by HEAD method, so the code (both requests and mechanize) uses GET method to get filename. But still sometimes coursera doesn't return the right filename. And as the result of GET method, after we get the header, we close the connection. Somehow these manually closed GET connection will make the further coursera connection unstable. I suspect that couresra has some anti-batch download policy.

As for unicode, mechanize doesn't handle very well.

danmbox commented 10 years ago

So can you configure requests to not perform double-GETs (and use the filename from index.html)? And would that make the connections more "stable" (i.e. no chunks in wrong order, no HTTP headers in files)?

"unicode, mechanize doesn't handle very well" -- is this about file contents or file name? Again, you could use the file name from index.html.

xu-cheng commented 10 years ago

That's exactly what I am going to do.

danmbox commented 10 years ago

I see zero problems from mechanize branch, in recent testing. I have a script that checks for broken videos, PDFs containing HTTP headers, empty files (all the goodies that came along with requests).

vikasbucha commented 10 years ago

I had the exact same problem. submitted a pull request for quick fix.

nick-s-b commented 9 years ago

Is there a fix yet? This bug has rendered coursera-dl completely useless to be. It happens ALL the time. So annoying.

danmbox commented 9 years ago

Use the mechanize branch, it works perfectly.

@dgorissen, would it make sense to make mechanize the default branch? @xu-cheng has disappeared after destroying coursera-dl.

xu-cheng commented 9 years ago

I'm not disappeared. I'm just extreme preoccupy and busy.

dgorissen commented 9 years ago

Have set the default branch in github to mechanize. Will update the package in pip as well. It is behind a number of commits though. @danmbox would appreciate it if you could have a look at those, port over the ones you think are important, and issue a PR.

dgorissen commented 9 years ago

Ok, I found some time and backported a number of commits from master to mechanize & bumped the version. Please test. If I hear no complaints I will update the package on pypi

dgorissen commented 9 years ago

Package updated on pypi. Will close this issue. Feel free to reopen if anybody wants to tackle this on the 2.x master branch.