coursera-dl / edx-dl

A simple tool to download video lectures from edx.org (and other openedx sites)
GNU Lesser General Public License v3.0
1.93k stars 640 forks source link

downloaded pdfs are corrupt #462

Open cortinah opened 7 years ago

cortinah commented 7 years ago

Downloaded pdfs are corrupt

I am able to download all course materials. the .mp4 videos work fine, but the .pdf files are corrupt

Your environment

Steps to reproduce

Tell us how to reproduce this issue. Please provide us the course URL, and the specific subsection or unit if possible. Course url: https://courses.edx.org/courses/course-v1:CaltechX+CS1156x+3T2017/course/

./edx-dl.py -u xxxx https://courses.edx.org/courses/course-v1:CaltechX+CS1156x+3T2017/course/

Expected behaviour

pdf files should be able to be read.

Actual behaviour

pdf files can not be opened by Preview on Mac or pdf reader on Linux Mint

Mela commented 7 years ago

I'm encountering the same issue. Looks like edx has blocked the access via edx-dl:

"The owner of this website (prod-edxapp.edx-cdn.org) has banned your access based on your browser's signature"

balta2ar commented 7 years ago

@Mela Could you please show a screenshot of that message?

This is sad that edx resorts to such measures (I've never noticed anything like that from Coursera, for example). We've had several contacts with edx employees (#392, #377) but discussions eventually lead to nowhere. Unfortunately, it looks as if this tool is written in such a way that it puts undue load on edx servers. It is probably should be revamped and overhauled but it can't be done instantly and it looks like primary contributors don't have opportunities to do that right now.

From the practical side, I think you could:

  1. hack the script and change user agent
  2. hack the script and remove parallelization (hoping that less intensive load will be tolerated by edx)
  3. add delays into the code
  4. download over proxies.

What are your thoughts, guys? @rbrito @iemejia

Mela commented 7 years ago

Here, @balta2ar 02-areas_habitacionales_y_funerarias pdf -dropbox-uni-edx-deciphering ion_and_an_abbreviated_history _075

cortinah commented 7 years ago

I don't really know how this works exactly, but isn't it odd that the .mp4 files (very large) work fine and the .pdfs (very small) are the ones with this issue? Could this just be an encoding problem, such as when pdf email attachments used to get corrupted? Please let me know if i can help with any testing or in any other way.

Mela commented 7 years ago

@cortinah The mp4-Files are, as far as I know, stored at and downloaded from Youtube. Downloading them does not put load on the edX-Servers. Downloading the course pages and other files does.

balta2ar commented 7 years ago

@cortinah Could you please attach your pdfs or check yourself whether they have similar structure as @Mela has demonstrated?

cortinah commented 7 years ago

It looks like the content of these ".pdf" files is the same as @Mela 's: Here is one:

<!DOCTYPE html>

Access denied | prod-edxapp.edx-cdn.org used Cloudflare to restrict access

Error 1010 Ray ID: 3c420253b99b4716 • 2017-11-27 03:30:29 UTC

Access denied

What happened?

The owner of this website (prod-edxapp.edx-cdn.org) has banned your access based on your browser's signature (3c420253b99b4716-ua48).

Mela commented 7 years ago

@balta2ar could you point me to the place where I should hack the script to change the user agent?

balta2ar commented 7 years ago

@Mela https://github.com/coursera-dl/edx-dl/blob/master/edx_dl/edx_dl.py#L420 but this part is only used for login, AFAIK. https://github.com/coursera-dl/edx-dl/blob/master/edx_dl/utils.py#L53 -- this function may need patching as well, depending on what comes in headers arguments. I'm a little fuzzy on the details of the codebase, you may need to check all GET requests and print/debug.

Of course, this is not guaranteed to work as I don't know what else could constitute to "browser's signature".

Mela commented 7 years ago

Thank you, @balta2ar. I will look into it when I find the time.

rbrito commented 7 years ago

Ouch, this makes working with edX a bit hard. 😞

They are a very smart group of people and one change that we make in the user-agent will certainly get caught by other strategies that they can employ, with other fingerprinting techniques.

Anyway, if it were only down to me, I would disable the parallelism by default and change the user-agent. The first point is to be gentle on their servers, while the second is to get through the issue of the program having stopped working.

Thanks for all the action, BTW!

jcline-ieee commented 7 years ago

Another example of failure. Independent of changing user agent.

this works fine: $ youtube-dl https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4 this works fine: $ /usr/local/bin/wget https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4 this works fine: $ curl https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4 this works fine: Open that link in browser

But edx_dl does not work: ( I thought it was an SSL problem - which edx_dl has caused for me before, but maybe SSL is not the problem )

root[skip_or_download] [download] https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4 => edx/Introduction_to_Performance_Psychology/01-Course_Introduction/01-JLDIPPXX2017-V001200_DTH.mp4 root[download_url] Got SSL/Connection error: HTTP Error 403: Forbidden [https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4] root[download_url] SSL/Connection error ignored: HTTP Error 403: Forbidden root[_build_subtitles_downloads] No video downloaded for 01 root[skip_or_download] [download] https://edx-video.net/JLDIPPXX2017-V001100_DTH.mp4 => edx/Introduction_to_Performance_Psychology/02-Module_1-_Practice_that_Sticks/01-JLDIPPXX2017-V001100_DTH.mp4 root[download_url] Got SSL/Connection error: HTTP Error 403: Forbidden [https://edx-video.net/JLDIPPXX2017-V001100_DTH.mp4] root[download_url] SSL/Connection error ignored: HTTP Error 403: Forbidden root[_build_subtitles_downloads] No video downloaded for 01

Maybe it is urlretrieve that is causing this because, the below code does immediately retrieve a file but the test.mp4 contents are HTML "Access Denied" page:

import six

six.moves.urllib.request.urlretrieve('https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4', "test.mp4")

<title>Access denied | edx-video.net used Cloudflare to restrict access</title>
....
<span class="cf-footer-item"><span data-translate="your_ip">Your IP</span>: 255.255.255.255</span>

Also, urlretrieve does not seem like a good method for other reasons too like difficult to set SSL context (?).

jcline-ieee commented 7 years ago

I fixed this by not using urlretrieve. I verified my fix works vs. being previously broken on the course https://courses.edx.org/courses/course-v1:JuilliardX+JX004x+2T2017

  1. coursera-dl uses external downloaders (like wget), including parallel or sequential downloading, whereas,
  2. edx-dl uses internal downloader (causing this bug). edx-dl uses youtube-dl for youtube links but otherwise uses urlretrieve (which also can have SSL problems on macOs python due to certificates missing from python install etc)
  3. the problem must be the missing HTTP headers because of using urlretrieve, not a problem with a server implementing IP blocking or UA blocking. So probably a better internal downloader would fix this download problem.
  4. But why reinvent the wheel, just call wget.

Here is the hacked patch.

Patch

edx_dl.py

def download_url(url, filename, headers, args):
    """
    Downloads the given url in filename.
    """

    if is_youtube_url(url):
        download_youtube_url(url, filename, headers, args)
    else:
    # jcline
        bin = 'wget'
        cmd = [bin, url, '-c', '-O', filename, '--no-cookies', '--no-check-certificate']
        execute_command(cmd, args)
cortinah commented 7 years ago

I applied above patch from @jcline-ieee and can confirm the .pdf files downloaded correctly. Thank you very much @jcline-ieee.

elalla commented 6 years ago

In my case is not working, maybe you changed something else?

I used the last updated version and also made the modification but still getting corrupted pdfs

mmoglia commented 6 years ago

I also confirm that with @jcline-ieee fix the problem was resolved! Thanks!

elalla commented 6 years ago

@mmoglia can you share the changed file?

jcline-ieee commented 6 years ago

the fix will only work if you have 'wget' installed. because the fix uses wget to download the files instead, as previously, using the buggy python library.

check to see you have wget in the path.

elalla commented 6 years ago

Many thanks, @jcline-ieee. I have installed wget and edx-dl is working but still have the problem. Just in case, I replaced the previous download_url() in edx_dl.py completely with the code provided by you:

def download_url(url, filename, headers, args): """ Downloads the given url in filename. """ if is_youtube_url(url): download_youtube_url(url, filename, headers, args) else:

jcline

    bin = 'wget'
    cmd = [bin, url, '-c', '-O', filename, '--no-cookies', '--no-check-certificate']
    execute_command(cmd, args)

Is that ok?

jcline-ieee commented 6 years ago

yes that's right. the fix is just a few lines. I am using an older version of edx-dl so I would have to update to the latest to verify why it might not be working for you. if it is not working for you, then, 1. post the url of the course which is failing for you. 2. look into the corrupt pdf file to see what is inside, with a text editor, perhaps the contents of the file is an error message.

elalla commented 6 years ago

@jcline-ieee thanks for the answer. As you mentioned I used the text editor to see what's inside the pdf. Please find it below:


    <div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
    <div id="cf-error-details" class="cf-error-details-wrapper">
      <div class="cf-wrapper cf-header cf-error-overview">
        <h1>
          <span class="cf-error-type" data-translate="error">Error</span>
          <span class="cf-error-code">1010</span>
          <small class="heading-ray-id">Ray ID: 4176bfbcbdcb1acf &bull; 2018-05-07 21:23:39 UTC</small>
        </h1>
        <h2 class="cf-subheadline">Access denied</h2>
      </div><!-- /.header -->

      <section></section><!-- spacer -->

      <div class="cf-section cf-wrapper">
        <div class="cf-columns two">
          <div class="cf-column">
            <h2 data-translate="what_happened">What happened?</h2>
            <p>The owner of this website (prod-edxapp.edx-cdn.org) has banned your access based on your browser's signature (4176bfbcbdcb1acf-ua48).</p>
          </div>
jcline-ieee commented 6 years ago

well, you can see what the error is: "Please enable cookies." So, modify the code to use cookies. Change the '--no-cookies' to '--keep-session-cookies'
Then try again and report back.

elalla commented 6 years ago

@jcline-ieee Thanks (and sorry for the delay). Still not working (the same problem with the 'Please enable cookies).

I change the edx-dl as you mentioned as follows:

cmd = [bin, url, '-c', '-O', filename, '--keep-session-cookies', '--no-check-certificate']

Am I doing something wrong?

jcline-ieee commented 6 years ago

What is the URL of the pdf ? (you should be able to see the URL in the edx output) Can you use your command line with wget and download it manually? i.e. wget --keep-session-cookies --no-check-certificate "URL"

Does that also fail?

On 5/28/18, Eduardo Lalla-Ruiz notifications@github.com wrote:

@jcline-ieee Thanks (and sorry for the delay). Still not working (the same problem with the 'Please enable cookies).

I change the edx-dl as you mentioned as follows:

cmd = [bin, url, '-c', '-O', filename, '--keep-session-cookies', '--no-check-certificate']

Am I doing something wrong?

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/coursera-dl/edx-dl/issues/462#issuecomment-392584383

--

Jonathan Cline

jcline@ieee.org

Mobile: +1-805-617-0223

########################