Course Tree Hierarchy and download unification

coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)

GNU Lesser General Public License v3.0

1.93k stars 638 forks source link

Course Tree Hierarchy and download unification #283

Open iemejia opened 9 years ago

iemejia commented 9 years ago

Hi,

I add @rbrito and @balta2ar messages from a previous commit to give a bit of context and don't lose the subject in a commit message:

iemejia commented 9 years ago

@rbrito said: : I talked some time ago with @iemejia about not treating the youtube urls as separate and just putting them on the list of resources, but he didn't agree.

I can see his reasoning, as, usually, some courses have the youtube video as their "primary thing". On the other hand, it may be worth experimenting with creating these classes in something like a tree, as in:

each Course contains a list of Sections
each Section contains a list of SubSections
each Subsection contains a list of Unit
each Unit contains a list of "Things"

Here, "Things" would be any kind of resource, like Youtube URLs, links to PDFs, links to source code, spreadsheets etc. that the instructors may happen to leave for the students.

This way, we capture the essence of what we scrapped from various pages in our own tree structure and we can traverse it as much as we please.

Perhaps we can accomodate @iemejia's desire of distinguishing some URLs by having multiple kinds of "Things", perhaps a "YoutubeVideo" or something like that?

I would sincerely love to have the tree structure as I mentioned above, especially if we provide repr and str methods to the classes, which will be super helpful for debugging purposes.

For instance, I would love to be able to print one course (say, print(cur_course), where cur_course is a Course) and have all the tree structure dumped, so that I can diff two runs of the script and see if we are being deterministic, if edX may be randomizing what they are supplying us, if we are indeed grabbing things more than once (see my previous comments of the script downloading some videos multiple times) etc.

In fact, one of the worst "offenders" to this downloading the same video multiple times was Stanford's Convex Optimization course (recommended, BTW!). Each week, there was only 1 video with more than 1 hour and there were some weeks where we were capturing the same video more up to 9 times.

Printing a course and doing a post-morten analysis would be super helpful to debug this kind of behavior. Also, now that we started using the logging module, we should be littering the code with those, so that we ask users to enable the debug command line option and see the difficulties that they are experiencing.

Comments? If any of you think that what I wrote above is helpful/worthwhile, feel free to copy the parts and file an issue with my observations.

Thanks,

Rogério.

iemejia commented 9 years ago

@balta2ar said:

I like the idea of having a tree. Thinking about that in my favorite "merge (edx + coursera)-dl" context, we could generate trees structured differently for different platforms, but have a single downloader module for all of them.

iemejia commented 9 years ago

I insisted in treating the youtube urls differently since they were different resources not in the logical sense, but explicitly in the way both edx-dl and OpenEdX deal with them. They are (or at least were) extracted differently and they are (and will) be downloaded differently:

I agree that from a logical sense the whole tree structure proposed by Rogerio, and I think we must go ahead and change it, that resolve the debugging problem Rogerio mentions, however this can be a good idea to make the structure uniform, but to do this, we need a way to deal with the download cases, since Video and 'Things' (which we can call 'Resources') have a different download logic:

Resources (normal URLs for videos, pdfs, etc) need two things: the url and the destination path, (eventually they may need the cookie information for resources hosted in edx), For that reason they are trivial to export and get with a download manager.
Videos, and particularly youtube videos are trickier, they need a different download process (call to youtube-dl and its args), and also they download the subtitles using the OpenEdX JSON API (which differs of a normal download too).

So maybe we need a hierarchy that takes advantage of the classes.

each Course contains a list of Sections each Section contains a list of SubSections each Subsection contains a list of Unit each Unit contains a list of "Things"

And Resources only contain an URL, but they MUST be of different 'types': Resource -> [StaticResource, YoutubeResource, EdXSubtitleResource] so the downloader can deal in the appropriate way with them.

iemejia commented 9 years ago

@baltasar idea is really good, since that would be a really easy to merge point for both projects, since downloading is mostly equal, however we have to consider the cases discussed, and also see that edx-dl does not depend in the requests library (which in principle is the reason why I think it does not have the recent SSL bug many times reported in coursera-dl)

iemejia commented 9 years ago

Thinking a bit about it, the current data structures in the script represent too different things:

The resources parsed from the website

Course -> Section -> SubSection -> Unit -> [Video, Resource]

which we can integrate and represent fully with the proposed changes as:

Course -> Section -> SubSection -> Unit -> Resource

The selections of the user so they can be downloaded

selections = {Course: [Section]}

and the download index:

all_units = {Subsection.url: [Unit]}

Currently the download process iterates each selection and for each subsection it extracts their Units, building the tree does not seem hard from here, which can help debugging (even at the cost of repeated information), but for me the important portion is to reduce the complexity of the downloading process.

I will try to write a patch later on and you tell me what you think.

balta2ar commented 9 years ago

which in principle is the reason why I think it does not have the recent SSL bug many times reported in coursera-dl

Judging from the number of stargazers, audience of edx-dl is order of magnitude less than of coursera-dl. Maybe that's why there were no reports about SSL in edx-dl.

And Resources only contain an URL, but they MUST be of different 'types': Resource -> [StaticResource, YoutubeResource, EdXSubtitleResource] so the downloader can deal in the appropriate way with them.

Yes, yes and yes. That's a very good idea to separate resources, I second that.

Below are my thoughts on different topics, let them be in this discussion.

Tree

There are many reasons to have a unified course tree. First, it's a very logical way to represent a course structure. The tree, as it's been said many times already, can be printed and compared, it can be used to find and eliminate duplicates. At the moment we have PageExtractor which extracts units. I think extractor should extract the whole tree, so that we have a module that can transform course name/link into a tree in a single place. We should probably think about separating code that makes network requests from the code that parses the pages.

Downloader

Downloader should know nothing about the course structure. It should only know which URL to download and where to save it (some credentials are required, e.g. cookies for downloading). It feels that input to the downloader should be something as simple as a list of tuples or better something iterable (in general case). Maybe having an iterator of the course tree is a good idea. For the downloader, the structure of the course is built up from filenames.

Performance

Before moving to tree extractor, we should not forget about performance. Specifically, units are extracted in parallel. Can we support that in tree extractor? Can be parallelize other parts of extraction process and keep it simple? A pool of producers/consumers might help here, but that might also be unnecessary over complication.

Types and data structures

I would like to stress the importance of having types and structures in the program. In Python it's very easy to end up with a bunch of dicts, lists, tuples and sets passing around in a program but in my experience that leads to incomprehensible code.

Generally, I like the code in edx-dl, how it's commented, and how things are named.

Maybe @shk3 can also join the discussion?

iemejia commented 9 years ago

We should probably think about separating code that makes network requests from the code that parses the pages.

This is already separated, but probably we have to make it more explicit with some refactoring.

Downloader should know nothing about the course structure. It should only know which URL to download and where to save it (some credentials are required, e.g. cookies for downloading). It feels that input to the downloader should be something as simple as a list of tuples or better something iterable (in general case). Maybe having an iterator of the course tree is a good idea. For the downloader, the structure of the course is built up from filenames.

Amen. We are really aligned in this subject, if you check the current download functions go in that direction, actually I tried to move the downloader class from coursera just to finish this, but I desisted when I saw the requests dependency. Any help here to write a more robust (and compatible) downloader is really welcomed.

Before moving to tree extractor, we should not forget about performance. Specifically, units are extracted in parallel. Can we support that in tree extractor? Can be parallelize other parts of extraction process and keep it simple? A pool of producers/consumers might help here, but that might also be unnecessary over complication.

Yes, I wouldn't like to change the current version, I think we can keep it simple as it is at the price of the repeated data structure (all_units) which would contain a flat view of the subsections in the tree (as it already does).

Generally, I like the code in edx-dl, how it's commented, and how things are named.

Thanks, we are not perfect yet, there are some rough spots (like the ones discussed here (that appeared thanks to a heavy refactoring work), but it is impressive to see the evolution of this project in the last 6 months, I am sure you wouldn't have said the same about the code some months ago :).

balta2ar commented 9 years ago

Any help here to write a more robust (and compatible) downloader is really welcomed.

Could you share more details on what exactly needs to be done with downloader? You sound like it might be a better idea to port downloader from edx to coursera, as the former one does not use requests.

iemejia commented 9 years ago

The current downloader is basically the simplest downloader I could implement, you can find the code in the file edx_dl, in the download_url function, as you can read, this one lacks in many senses:

It does not recover from partial downloads
It does not support retries in case of error
It doesn't even detect which files were not downloaded (to retry)
It doesn't do post download verification (e.g. that the size is the expected one, not to talk about checksums or the return code of the invocation of youtube-dl).

Maybe I am missing some other functionalities, but those are the ones that come to my mind from the reported issues, anyway the idea is to have a robust downloader but not to compete with the heavy downloaders since this is way out of the goals of this project (to have a full wget or aria2 set of options).

Also I would like to refactor some of this code to a module that deal with download specific details, so the idea of moving the downloader makes sense mostly from coursera-dl to edx-dl, because the coursera version is more modular and already has a hierarchy where we could add an implementation that does not use requests, and create a youtube-dl based implementation (for the youtube videos).

Another alternative could be to add the requests library to edx-dl (which I am not against, but I consider that this requires more work since authentication code must be rewritten to pass the cookie information to the downloader), but this can have other advantages.

https://github.com/coursera-dl/coursera-dl/blob/master/coursera/downloaders.py

One extra thing that we will win is that it would be relatively easy to support external downloaders as coursera-dl does, this is not something I really care about but apparently this is a feature that some people prefer, but well, more features = more maintenance, but sharing the code can help reduce the pain, so maybe we could add those.

ormsbee commented 8 years ago

@iemejia, @balta2ar: Hi folks. I'm a developer at edX, and I've been noticing a steady uptick in people using edx-dl lately. It mostly comes up in our logging because doing so many parallel requests across units in the same sequence causes write contention for the sequence state (since we save a student's "current" position).

Have you looked at the Course Blocks API by any chance? It should be able to get you the summary, structure, and video information a lot more conveniently than doing a crawl of the courseware HTML. It's used by our mobile app, so there's probably a good deal of overlap in terms of the use case. If that API is missing something that you need, it might be possible to add it in a PR (we are an open source project, after all).

Anyhow, if you're interested, please reach out to me at dave@edx.org. We can talk about how you could get the information you need in a way that is faster for your users, gentler on our servers, less noisy for our researchers, and more robust for everyone over the long run. :-) FWIW, there are many unconventional things that power users can do with edX course structure -- DAGs instead of trees, hierarchy patterns that don't match the usual chapter/section/sequence, etc.

Take care.

iemejia commented 8 years ago

Hello, thanks for writing, it is nice to know you follow our work and we are open to fix edx-dl to be a good citizen of OpenEdX community (e.g. we have already reorganize parts of the code to be able to rate limit some stuff).

We have checked the API when it was in Beta, but nobody has taken the job of writing a decent REST integration (thanks @balta2ar for creating the issue). I agree with you, it is true that it would be nicer than the current approach, but on the other hand we probably would need to support the two (at least for the other openedx sites who don't support the API yet). @ormsbee Is there already a python client for the REST API ? (and preferably on pip). Because if this is the case we could just write the integration in short time.

And just to cover the 'state' case you mention, if we access the stuff via REST it would be solved ? Or do you want us to indicate some how that we don't want to alter the sequent state ? In some previous message one of your colleagues mentioned that you could recognize our requests because we identify our requests via the 'User-Agent': 'edX-downloader/0.01', so we can agree and you would ignore this agent. Like this the state, and your stats won't be changed by edx-dl. Or if there is another fix you think it is better we will implement it.

Thanks again for writing and thanks for offering your help we will for sure keep in touch once we write the REST integration.

balta2ar commented 8 years ago

@iemejia @ormsbee @rbrito Guys, let's move the discussion regarding the API to the dedicated issue #377. Thanks!