coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.93k stars 638 forks source link

Transcript URL is incorrect #148

Closed pmitros closed 9 years ago

pmitros commented 9 years ago

edX transcripts are at .../handler/transcript/translation/en, not .../handler/transcript/translationen. The later URL works due to a bug in the Open edX codebase, combined with a matching bug in edx-downloader. Worth fixing:

  1. I doubt it will work forever....
  2. We do log and analyze what users do. If users go through a URL like this, the pipeline handles it a bit funny, and the model for those users might be a little bit off. Right now, this won't have any effect on user experience, but in the future, it might.

By the way, thanks for giving a nice and correct user agent.

iemejia commented 9 years ago

Hello,

Wow !, I am impressed we are talking with one of the creators of edx. Thanks for your creation :). It is nice to see that you care about our tiny 'lo-fi' script. Thanks a lot for the report, this bug was my mistake. I just created a small fix.

I would love to see if we can collaborate somehow (e.g. to have a more convenient process to download the material (videos, pages, etc), or to know in advance when some of the APIs or URLs will change, because our script is quite hacky and it breaks every time there is a change in the platform.

Now that the edx platform is open we can follow the code, but I think that a channel of communication / collaboration would be a good additional idea and it will save us some 'reverse engineering' time.

Now that you talk about the user-agent, I would love to know how many people use our script to download the courses, I don't expect you to have to the time to tell us, and I don't think it is a big number, but still it would be interesting to know :).

Thanks again for your creation !

-Ismael

pmitros commented 9 years ago

I'll give approximate numbers which I can grab from aggregate data: Nearly 300,000 hits from edx-downloader from a bit over a thousand unique IPs. Users do log in from multiple IPs, and do share IPs, but that should give a ballpark. Doing this on a per-user basis would be a little bit cumbersome (at least on a weekend).

We are building out a range of APIs in order to be able to do this which would be a bit more stable. Of particular interest is probably the mobile client APIs:

http://edx-platform-api.readthedocs.org/en/latest/endpoints.html

Do keep in mind that these are alpha, and not stable. But also do keep in mind we do have users on our mobile app accessing courses through these APIs, so breaking them is a little expensive, so hopefully shouldn't happen too often.

It is an open source project and platform, so if you do want to collaborate, the best way to do so is through the open source process. A good overview is at:

http://edx-developer-guide.readthedocs.org/en/latest/process/overview.html

The short version is that if you want to make a change, post to our mailing list before coding it up, and see what people think:

https://groups.google.com/forum/#!forum/edx-code

If you want, feel free to cc: the relevant developers (git blame is nice for figuring out who wrote a file) if their e-mail addresses happen to be public (if not, this isn't too important). In this case, I'll mention Dave Ormsbee is a good person to include -- he thinks a lot about how we'll evolve APIs. He also does follow edx-code.

If it looks good to people, code it up and make a pull request:

https://github.com/edx/edx-platform

If you'd like preliminary review before finishing, mark it as WiP, and tag the relevant people.

iemejia commented 9 years ago

Thanks a lot for the stats, the API information and the other links (I didn't even know that there is a mobile app now). I have to check the official API to see in detail what is available, but it seems there is a almost everything we need. Nice.

I have to check in more detail the other edx-related projects, but I expect that the growing ecosystem will give us ideas to improve our tool. Thanks again and don't hesitate to put more issues or contact us in case of further improvements or bugs.

pmitros commented 9 years ago

The mobile app has intentionally not been widely advertised yet. Only some courses are mobile-enabled right now, and the app is a bit past beta, but we're intentionally not going for a broad deploy until we've moved things along a little bit further.

What I'd actually love to see is an RSS feed of course videos, and have that feed into the mobile app (instead of the current API), edx-downloader, as well as any other RSS reader. It would let edX courses be advertised on dozens of podcasting services like iTunes. I did a quick prototype, but unfortunately, edX doesn't have capacity to do something like this at production quality for the foreseeable future.

iemejia commented 9 years ago

Why is it hard to do (at production quality) the RSS feed ? priorities ? or lack of resources (hardware: heavier server use / human: programmers, more maintenance) ? I ask because it doesn't seem to me that difficult to do a RSS feed per course. On the other hand if you do personalized rss feeds, e.g. user * course + video viewing progress, it can get harder to do, but still not too hard with what already exists, or I am missing something ?

One functionality I have always thought that can be improved is the way to follow your progress, The current section is hidden, and I would like to have that feedback when I arrive to the dashboard, and some kind of alert to motivate me to follow. I consider the progress section of edx is probably the reason I tend to obssess and at least achieve the minimum passing mark (which I have way more trouble to do in coursera). I think having this functionality early, or at all times (think for example of a progress bar like unread messages on facebook) can be really useful.

And now that we discuss functionalities have you plans the improve the discussion forums ?, I have always found them hard to follow, and I noticed they are not in the API, are discussions going to be part of the mobile app anytime soon ? Well I can understand that this is something harder to do (and way harder to put in an open public API), but discussion are really important in the learning process.

pmitros commented 9 years ago

RSS feed is straightforward and acknowledged as a good idea. It's just that everyone has much higher priorities right now, so there really isn't anyone to do it. I kind of floated it out there on the off-chance you (or someone else in the open source community) might be interested. If someone were to build this, I'd be glad to review a PR. There was a proof of concept (http://podcasts.edx.org/, https://github.com/pmitros/edxml-tools/blob/master/make_course_rss.py), but it was a one-off hack to show the value, and architecturally, it is not a starting point for adding this into the platform as APIs.

Progress tracking and better indication of what assignments are unfinished and soon due would be a big win, but the way the platform is structured, would be a pretty large chunk of work. I have a few UX mockups along those lines, but that's about all.

Forums are an area of on-going improvement, but they do have a long ways. I don't think that's a top priority for mobile right now -- as a step zero, mobile needs to support all courses. As a step 1, mobile is still missing basic functionality like assessments, registration, etc. Until those are in place, it's a nice companion to the web site, but for now, that's all it is. The mobile team also has some ambitions to build out much richer social experiences on mobile, but those are at least a few quarters away.

iemejia commented 9 years ago

I am closing this issue since the fix is already there. Thanks again @pmitros for reporting, and I hope we keep in touch for any new development or advice.

iemejia commented 9 years ago

Hello again @pmitros,

I write to you again since I have two questions to continue our previous discussion:

  1. are you still interested in this rss-like functionality ? I have a bit more time now than before and I am interested in programming it if it has not been built yet. is it still needed ?
  2. I have been improving some parts of edx-dl in the previous days and one of my changes downloads html pages and extracts its resources in parallel, this creates in average 20 simultaneous connections per user. I imagine that you are rate-limiting your resources, and that those are generated statically, so I think there must not be any problem, however I prefer to ask if you think this is an issue (since we don't want the script to get banned or something). If it is an issue please tell me what would it be a decent rate-limit from the client side (our side) and we will apply it.

Cordially, -Ismael

pmitros commented 9 years ago

@lemejia

  1. rss functionality would still be very nice. If you have time, we could provide some guidance for how to do this.
  2. Number of connections per user is not an issue with videos. It is an issue with any pages in the course itself. Those are not static, and actually require a decent bit of computational power to generate (I can go into detail why this is, but on a high level, the LMS is designed to support a fair bit of dynamic content for things like randomized control trials and similar). 20 parallel connections seems high. I don't have guidance on number of parallel connections, but I do on number of requests per second. This should be on the order of what a human might generate (I'd say no more than a few per second).

An RSS feed would aliviate most of the performance issues here.

Rate limiting is not part of the platform. It may be part of the deployment -- I don't follow dev-ops closely -- but if we do have it on edx.org, that may not be the case for all Open edX deploys (we've got around 100 right now). The ecosystem is important. Between China, Saudi, Jordan, Mexico, France, Stanford, etc., I wouldn't be surprised if non-edx.org Open edX were bigger at this point than the main edx.org.

iemejia commented 9 years ago

Cool, I am going to start checking the RSS implementation you had done before and I will tell you, but you can explain me for the moment why that one was not enough, and what is the big goal over that one.

About the second point do you think 8 req/s per user is OK by default ? On the other hand users still can change this value and I can't do anything to restrict that. I have been thinking about putting a local cache to avoid hitting the site every time, and for development reasons, anyway I think the average user does not run the script many times on the same resources, and I don't expect our number of requests to be so high (since we are a really small percentage of the users of edX).

iemejia commented 9 years ago

(I edited the value from 5 to 8 req/s to have a multiple of 2 that it is worth),

pmitros commented 9 years ago

A good RSS implementation would consist of:

This would be part of the core platform code.

The current RSS implementation is completely outside of the platform:

The goal was to see just the level of publicity we could generate for edX by having courses in podcast aggregators such as iTunes, as well how well this worked in RSS viewers. Both worked well, but the implementation has zero or close to zero reusable code.

I'll tag @ormsbee in case he has any interest or insights around the discussion.