coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.92k stars 638 forks source link

Integration with edx using edx-platform-api #377

Open balta2ar opened 8 years ago

balta2ar commented 8 years ago

@ormsbee wrote:

@iemejia, @balta2ar: Hi folks. I'm a developer at edX, and I've been noticing a steady uptick in people using edx-dl lately. It mostly comes up in our logging because doing so many parallel requests across units in the same sequence causes write contention for the sequence state (since we save a student's "current" position).

Have you looked at the Course Blocks API by any chance? It should be able to get you the summary, structure, and video information a lot more conveniently than doing a crawl of the courseware HTML. It's used by our mobile app, so there's probably a good deal of overlap in terms of the use case. If that API is missing something that you need, it might be possible to add it in a PR (we are an open source project, after all).

Anyhow, if you're interested, please reach out to me at dave@edx.org. We can talk about how you could get the information you need in a way that is faster for your users, gentler on our servers, less noisy for our researchers, and more robust for everyone over the long run. :-) FWIW, there are many unconventional things that power users can do with edX course structure -- DAGs instead of trees, hierarchy patterns that don't match the usual chapter/section/sequence, etc.

Take care.

Moving discussion here from https://github.com/coursera-dl/edx-dl/issues/283#issuecomment-231939554.

balta2ar commented 8 years ago

@ormsbee Hello, David! Thank you for getting in touch with us!

For starters, I think that accessing the platform using documented API is certainly the preferred way to go. If, on top of that, we both benefit from it, I'm all for it. I glimpsed over the API and there are several unclear things to me, if you don't mind:

  1. I'm no OAuth2 expert and this part is very briefly mentioned in the docs, to my taste. Usually, services provide some more details on how to integrate with them using OAuth2. Of course there is no point in duplicating the whole draft but from what I've seen it was very convenient to read the gist of it in services' docs with examples instead of ploughing through the draft.
  2. What is the exact API server for edx.org? I've tried accessing https://www.edx.org/api/courses/v1/blocks?course_id=edX/DemoX/Demo_Course%20&all_blocks=true&requested_fields=graded%2Cformat%2Cstudent_view_multi_device in a browser, but it returned 404. It probably needs OAuth2, but is there a way to play with the API in a browser? For example, it was very useful when I was researching Coursera API.
  3. How universal is this API across edx platform "customers"? If I get it right, anyone can take the platform, modify it to her own taste and use it, so there is no guarantee the API is the same (or whether it's present/active) on other MOOC platforms that use edx under the hood. For example, could you please take a look at this site? https://openedu.ru/ I have a feeling they are using edx but when I tried integrating it, I found out they have slightly different HTML tags so it wasn't a piece of cake. Does this mean we may need to fallback to regular HTML scraping if there is no API?
  4. Is there an actual python implementation package for the API? That would be amazing, be there seem to be only docs.

All in all, I'm interested in seeing a clean API in edx-dl. @rbrito Please join our discussion.

ormsbee commented 8 years ago

@iemejia wrote:

@ormsbee Is there already a python client for the REST API ? (and preferably on pip). Because if this is the case we could just write the integration in short time.

I'm afraid there isn't at this point. Sorry.

And just to cover the 'state' case you mention, if we access the stuff via REST it would be solved ?

I think it would for most of your use cases. If you can grab what you need of the structure and content from one request here and then the rest of your requests are to linked static assets like videos, PDFs, and transcripts -- then we should avoid most student state writes.

Or do you want us to indicate some how that we don't want to alter the sequent state? In some previous message one of your colleagues mentioned that you could recognize our requests because we identify our requests via the 'User-Agent': 'edX-downloader/0.01', so we can agree and you would ignore this agent. Like this the state, and your stats won't be changed by edx-dl. Or if there is another fix you think it is better we will implement it.

I'm honestly not sure yet. As you point out, we could do this on our side and just flag the user agent that you send as not needing to save user state. But I'm not sure of the best way to route that knowledge through the existing code without causing unwanted side-effects or making things more complicated for authors who are writing XBlocks for our platform. I'll think about this some more. In any case, using the REST API would side-step most of these concerns.

@balta2ar wrote:

I'm no OAuth2 expert and this part is very briefly mentioned in the docs, to my taste. Usually, services provide some more details on how to integrate with them using OAuth2. Of course there is no point in duplicating the whole draft but from what I've seen it was very convenient to read the gist of it in services' docs with examples instead of ploughing through the draft.

Yeah, apologies for that. We're working on a fairly major revamping of that stuff, but I'm not sure when it will land. That being said, for this particular API call, you could get it via the session as you already do.

What is the exact API server for edx.org? I've tried accessing https://www.edx.org/api/courses/v1/blocks?course_id=edX/DemoX/Demo_Course%20&all_blocks=true&requested_fields=graded%2Cformat%2Cstudent_view_multi_device in a browser, but it returned 404. It probably needs OAuth2, but is there a way to play with the API in a browser? For example, it was very useful when I was researching Coursera API

So www.edx.org is the site that people go to in order to find the courses, but the courseware itself runs on courses.edx.org. Also, all_blocks=true isn't a flag that you'd be able to use, because it's meant to be used by other services at edX. Saying all_blocks=true means "ignore all access checks" and is useful if you have something like edX Insights (our analytics tool for course staff), which wants to just pull in a complete outline of the course.

In any event, please enroll in VideoX and try the following link, substituting your username in the appropriate param:

https://courses.edx.org/api/courses/v1/blocks/?course_id=course-v1%3AedX%2BVideoX%2B1T2016&username={your_username_here}&depth=all&student_view_data=video

It should load on your browser. To get it to spit back raw JSON, send the appropriate accept header. Please look at what it has and let's talk about what's missing from your point of view.

How universal is this API across edx platform "customers"? If I get it right, anyone can take the platform, modify it to her own taste and use it, so there is no guarantee the API is the same (or whether it's present/active) on other MOOC platforms that use edx under the hood. For example, could you please take a look at this site? https://openedu.ru/ I have a feeling they are using edx but when I tried integrating it, I found out they have slightly different HTML tags so it wasn't a piece of cake. Does this mean we may need to fallback to regular HTML scraping if there is no API?

It's true that people can customize as they see fit. Most people who use the platform use one of our regular releases as a starting base (the latest stable is Dogwood, the next will be Eucalyptus). This API came out with Dogwood, and I don't know how many sites running edx-platform are on that release. That being said, people are much more likely to customize the front end than they are to customize this particular API, so it's probably your better bet in the long run anyway.

Is there an actual python implementation package for the API? That would be amazing, be there seem to be only docs.

Unfortunately, no.


FWIW, I plan to do work on our side to better handle the scraping as well. This is one of those situations where we're trying to be holistic and push on a few different fronts at once. I wanted to work closely with you folks since this seems to be the most popular edX downloading application, and anybody writing crawlers in the future will likely study your code as a starting point. 😄

Thank you!

iemejia commented 8 years ago

Mmm I just found something interesting, it seems edx has a swagger specification for its API ! https://github.com/edx/api-manager/tree/master/swagger If this is up to date maybe we can use https://github.com/mission-liao/pyswagger And generate automatically all the needed artifacts, this could even be a contribution as a python-client for the edx project !

ormsbee commented 8 years ago

Just a warning: The api-manager stuff is where we're trying to organize an API gateway. It will eventually have all our public APIs in it, but I don't think the course blocks stuff has made it in yet.

ormsbee commented 8 years ago

Okay folks, someone at edX brought it to my attention that scraping and bulk download of any sort is prohibited by our Terms of Service:

Furthermore, you agree not to scrape, or otherwise download in bulk, any Site content, including but not limited to a list or directory of users on the system, on-line textbooks, User Postings or user information.

I'm a developer. I saw a performance issue and immediately reached for a technical solution on my own initiative. However, there are privacy and third-party content licensing concerns that play into this. I don't think you folks download any forum/user data, which is good (that is a huge red line for us -- please don't ever start). But course content in general is a complicated issue, and widespread use of this tool may inhibit our ability to bring these materials to our students in the future.

So as not to give the impression that edX endorses the use of edx-dl, I'm going to stop commenting on this issue. My apologies for any confusion I might have caused.

iemejia commented 8 years ago

Thanks for clarifying this, don't worry we know the rules, and knowing that scraping is discouraged we will probably move faster into using the API. And of course we know that you do not enforce tools like this one, but it is still nice that we can have an open discussion here.

One thing that I think the authors of this tool share is the respect for the platform and the content creators, the initial reason we created this tool is to help us follow the courses and use the material when we are offline or with limited connectivity. This has been useful for users in particular in developing countries, and we have received many emails telling us the importance of this tool.

Remember that edx-dl wouldn't even exist if the edx platform allowed to download the materials, but this is not always the case, so here we are. Anyway, thanks a lot for sharing your opinions, and please tell the devs of the platform that in any case we are open to fix our tool to be good edx citizens.