Open cortinah opened 7 years ago
I'm encountering the same issue. Looks like edx has blocked the access via edx-dl:
"The owner of this website (prod-edxapp.edx-cdn.org) has banned your access based on your browser's signature"
@Mela Could you please show a screenshot of that message?
This is sad that edx resorts to such measures (I've never noticed anything like that from Coursera, for example). We've had several contacts with edx employees (#392, #377) but discussions eventually lead to nowhere. Unfortunately, it looks as if this tool is written in such a way that it puts undue load on edx servers. It is probably should be revamped and overhauled but it can't be done instantly and it looks like primary contributors don't have opportunities to do that right now.
From the practical side, I think you could:
What are your thoughts, guys? @rbrito @iemejia
Here, @balta2ar
I don't really know how this works exactly, but isn't it odd that the .mp4 files (very large) work fine and the .pdfs (very small) are the ones with this issue? Could this just be an encoding problem, such as when pdf email attachments used to get corrupted? Please let me know if i can help with any testing or in any other way.
@cortinah The mp4-Files are, as far as I know, stored at and downloaded from Youtube. Downloading them does not put load on the edX-Servers. Downloading the course pages and other files does.
@cortinah Could you please attach your pdfs or check yourself whether they have similar structure as @Mela has demonstrated?
It looks like the content of these ".pdf" files is the same as @Mela 's: Here is one:
<!DOCTYPE html>
The owner of this website (prod-edxapp.edx-cdn.org) has banned your access based on your browser's signature (3c420253b99b4716-ua48).
@balta2ar could you point me to the place where I should hack the script to change the user agent?
@Mela
https://github.com/coursera-dl/edx-dl/blob/master/edx_dl/edx_dl.py#L420 but this part is only used for login, AFAIK.
https://github.com/coursera-dl/edx-dl/blob/master/edx_dl/utils.py#L53 -- this function may need patching as well, depending on what comes in headers
arguments. I'm a little fuzzy on the details of the codebase, you may need to check all GET
requests and print/debug.
Of course, this is not guaranteed to work as I don't know what else could constitute to "browser's signature".
Thank you, @balta2ar. I will look into it when I find the time.
Ouch, this makes working with edX a bit hard. 😞
They are a very smart group of people and one change that we make in the user-agent will certainly get caught by other strategies that they can employ, with other fingerprinting techniques.
Anyway, if it were only down to me, I would disable the parallelism by default and change the user-agent. The first point is to be gentle on their servers, while the second is to get through the issue of the program having stopped working.
Thanks for all the action, BTW!
Another example of failure. Independent of changing user agent.
this works fine: $ youtube-dl https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4 this works fine: $ /usr/local/bin/wget https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4 this works fine: $ curl https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4 this works fine: Open that link in browser
But edx_dl does not work: ( I thought it was an SSL problem - which edx_dl has caused for me before, but maybe SSL is not the problem )
root[skip_or_download] [download] https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4 => edx/Introduction_to_Performance_Psychology/01-Course_Introduction/01-JLDIPPXX2017-V001200_DTH.mp4 root[download_url] Got SSL/Connection error: HTTP Error 403: Forbidden [https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4] root[download_url] SSL/Connection error ignored: HTTP Error 403: Forbidden root[_build_subtitles_downloads] No video downloaded for 01 root[skip_or_download] [download] https://edx-video.net/JLDIPPXX2017-V001100_DTH.mp4 => edx/Introduction_to_Performance_Psychology/02-Module_1-_Practice_that_Sticks/01-JLDIPPXX2017-V001100_DTH.mp4 root[download_url] Got SSL/Connection error: HTTP Error 403: Forbidden [https://edx-video.net/JLDIPPXX2017-V001100_DTH.mp4] root[download_url] SSL/Connection error ignored: HTTP Error 403: Forbidden root[_build_subtitles_downloads] No video downloaded for 01
Maybe it is urlretrieve that is causing this because, the below code does immediately retrieve a file but the test.mp4 contents are HTML "Access Denied" page:
import six
six.moves.urllib.request.urlretrieve('https://edx-video.net/JLDIPPXX2017-V001200_DTH.mp4', "test.mp4")
<title>Access denied | edx-video.net used Cloudflare to restrict access</title>
....
<span class="cf-footer-item"><span data-translate="your_ip">Your IP</span>: 255.255.255.255</span>
Also, urlretrieve does not seem like a good method for other reasons too like difficult to set SSL context (?).
I fixed this by not using urlretrieve. I verified my fix works vs. being previously broken on the course https://courses.edx.org/courses/course-v1:JuilliardX+JX004x+2T2017
Here is the hacked patch.
edx_dl.py
def download_url(url, filename, headers, args):
"""
Downloads the given url in filename.
"""
if is_youtube_url(url):
download_youtube_url(url, filename, headers, args)
else:
# jcline
bin = 'wget'
cmd = [bin, url, '-c', '-O', filename, '--no-cookies', '--no-check-certificate']
execute_command(cmd, args)
I applied above patch from @jcline-ieee and can confirm the .pdf files downloaded correctly. Thank you very much @jcline-ieee.
In my case is not working, maybe you changed something else?
I used the last updated version and also made the modification but still getting corrupted pdfs
I also confirm that with @jcline-ieee fix the problem was resolved! Thanks!
@mmoglia can you share the changed file?
the fix will only work if you have 'wget' installed. because the fix uses wget to download the files instead, as previously, using the buggy python library.
check to see you have wget in the path.
Many thanks, @jcline-ieee. I have installed wget and edx-dl is working but still have the problem. Just in case, I replaced the previous download_url() in edx_dl.py completely with the code provided by you:
def download_url(url, filename, headers, args): """ Downloads the given url in filename. """ if is_youtube_url(url): download_youtube_url(url, filename, headers, args) else:
jcline
bin = 'wget' cmd = [bin, url, '-c', '-O', filename, '--no-cookies', '--no-check-certificate'] execute_command(cmd, args)
Is that ok?
yes that's right. the fix is just a few lines. I am using an older version of edx-dl so I would have to update to the latest to verify why it might not be working for you. if it is not working for you, then, 1. post the url of the course which is failing for you. 2. look into the corrupt pdf file to see what is inside, with a text editor, perhaps the contents of the file is an error message.
@jcline-ieee thanks for the answer. As you mentioned I used the text editor to see what's inside the pdf. Please find it below:
<div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please enable cookies.</div>
<div id="cf-error-details" class="cf-error-details-wrapper">
<div class="cf-wrapper cf-header cf-error-overview">
<h1>
<span class="cf-error-type" data-translate="error">Error</span>
<span class="cf-error-code">1010</span>
<small class="heading-ray-id">Ray ID: 4176bfbcbdcb1acf • 2018-05-07 21:23:39 UTC</small>
</h1>
<h2 class="cf-subheadline">Access denied</h2>
</div><!-- /.header -->
<section></section><!-- spacer -->
<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="what_happened">What happened?</h2>
<p>The owner of this website (prod-edxapp.edx-cdn.org) has banned your access based on your browser's signature (4176bfbcbdcb1acf-ua48).</p>
</div>
well, you can see what the error is: "Please enable cookies."
So, modify the code to use cookies.
Change the '--no-cookies' to '--keep-session-cookies'
Then try again and report back.
@jcline-ieee Thanks (and sorry for the delay). Still not working (the same problem with the 'Please enable cookies).
I change the edx-dl as you mentioned as follows:
cmd = [bin, url, '-c', '-O', filename, '--keep-session-cookies', '--no-check-certificate']
Am I doing something wrong?
What is the URL of the pdf ? (you should be able to see the URL in the edx output) Can you use your command line with wget and download it manually? i.e. wget --keep-session-cookies --no-check-certificate "URL"
Does that also fail?
On 5/28/18, Eduardo Lalla-Ruiz notifications@github.com wrote:
@jcline-ieee Thanks (and sorry for the delay). Still not working (the same problem with the 'Please enable cookies).
I change the edx-dl as you mentioned as follows:
cmd = [bin, url, '-c', '-O', filename, '--keep-session-cookies', '--no-check-certificate']
Am I doing something wrong?
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/coursera-dl/edx-dl/issues/462#issuecomment-392584383
--
########################
Downloaded pdfs are corrupt
I am able to download all course materials. the .mp4 videos work fine, but the .pdf files are corrupt
Your environment
Steps to reproduce
Tell us how to reproduce this issue. Please provide us the course URL, and the specific subsection or unit if possible. Course url: https://courses.edx.org/courses/course-v1:CaltechX+CS1156x+3T2017/course/
./edx-dl.py -u xxxx https://courses.edx.org/courses/course-v1:CaltechX+CS1156x+3T2017/course/
Expected behaviour
pdf files should be able to be read.
Actual behaviour
pdf files can not be opened by Preview on Mac or pdf reader on Linux Mint