Closed lightbrush closed 8 years ago
So I got a possible answer.
Because I use shadowsocks to connect the internet, sometimes the connection is not stable. Thus some pages couldn't be fully loaded at the first time(stuck in loading permanently). Perhaps this leaded the process of downloading failing and stoped the whole script finally.
Would you like to write a function to refresh the page and try to downlod several times when the "urllib.error.ContentTooShortError" appears?
Many thanks~
Hi Chille, Now I'm pretty sure that the problem is my unstable internet status. As the document introduces:
urlretrieve() will raise ContentTooShortError when it detects that the amount of data available was less than the expected amount (which is the size reported by a Content-Length header). This can occur, for example, when the download is interrupted.
The Content-Length is treated as a lower bound: if there’s more data to read, urlretrieve() reads more data, but if less data is available, it raises the exception.
You can still retrieve the downloaded data in this case, it is stored in the content attribute of the exception instance.
If no Content-Length header was supplied, urlretrieve() can not check the size of the data it has downloaded, and just returns it. In this case you just have to assume that the download was successful.
Is it possible to restart download untill the script get the correct content?
Though I'm not familiar with Python, I think these findings might be useful for you. Hopefully this script could help more people who live in the areas of bad internet connection.
Thanks a lot
Is it able to download any files, or does it always fail on the first file?
Can you try replacing lines 120-127
try:
if sys.version_info >= (3, 0):
urllib.request.urlretrieve(url, path+url[url.rfind('/'):])
else:
urllib.urlretrieve(url, path+url[url.rfind('/'):])
except urllib.error.HTTPError:
print("Failed to download "+url)
continue
with
r = requests.get(url)
with open(path+url[url.rfind('/'):], 'wb') as f:
f.write(r.content)
You'll probably need to fix formatting.
Hi Chille,
This problem appeared randomly. The download worked at first and always stopped at some point.
For example,when the script tries to download the "Lecture Slides" part from https://class.coursera.org/pgm-003/, the script will stop downloading on a random pdf file(I find the order of download is from top to bottom. The download will stop unpredictably) and return the ContenTooShort Error
. Due to this, I think the reason might be the unstable internet connection.
Now I'm trying your new code. Most of the time, it works well~ The script downloaded all the materials I need of several courses except Quiz and Assignment(because I don't enter -q -a). COOL!!! Thanks very much! But rarely it returned such error:
Traceback (most recent call last):
File "dl_all.py", line 288, in <module>
download_sidebar_pages(session)
File "dl_all.py", line 227, in download_sidebar_pages
download_all_zips_on_page(session, path)
File "dl_all.py", line 121, in download_all_zips_on_page
r = requests.get(url)
File "D:\Anaconda3\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "D:\Anaconda3\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 468, in reque
st
resp = self.send(prep, **send_kwargs)
File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 570, in send
adapter = self.get_adapter(url=request.url)
File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 644, in get_a
dapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'javasc
ript:window.location.reload(false);'
And I find another potential problem, sometimes the Firefox will stuck at this situation, which makes the script pause.
One more suggestion, for download suffix, you may add '.r'(R script), '.xls', '.xlsx', '.csv'(Excel file).
Update:
Traceback (most recent call last):
File "dl_all.py", line 288, in <module>
download_sidebar_pages(session)
File "dl_all.py", line 227, in download_sidebar_pages
download_all_zips_on_page(session, path)
File "dl_all.py", line 121, in download_all_zips_on_page
r = requests.get(url)
File "D:\Anaconda3\lib\site-packages\requests\api.py", line 69, in get
return request('get', url, params=params, **kwargs)
File "D:\Anaconda3\lib\site-packages\requests\api.py", line 50, in request
response = session.request(method=method, url=url, **kwargs)
File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 468, in reque
st
resp = self.send(prep, **send_kwargs)
File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 570, in send
adapter = self.get_adapter(url=request.url)
File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 644, in get_a
dapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'javasc
ript:window.location.reload(false);'
This error is not unpreditable. It only happens on this page https://class.coursera.org/compfinance-009/questions
Update:
I find this error is caused by the pop of suffix'.r, '.xls', '.xlsx', '.csv'
, yet I don't know why as well....
I'll add .r to the download list, but I'm not really sure what to do about the connection issue. I had inconsistency issues with downloading as well when I was using a VPN.
Thanks Chillee~ At least most problems I met had been solved :) Thanks for your great script~
It is quick to download the weight by the wget. It has the recursive download feature to download everything under the specified directory.
$ wget –r [url]
Can you try replacing lines 120-127
try: if sys.version_info >= (3, 0): urllib.request.urlretrieve(url, path+url[url.rfind('/'):]) else: urllib.urlretrieve(url, path+url[url.rfind('/'):]) except urllib.error.HTTPError: print("Failed to download "+url) continue
with
r = requests.get(url) with open(path+url[url.rfind('/'):], 'wb') as f: f.write(r.content)
You'll probably need to fix formatting.
Works for me! Thank you!
Hi dude! After I downloaded the videos, the script always return such errors for every course I selected. (I tried different courses to check the bug, the common part is urllib.error.ContentTooShortError. I don't know what it is.)
My environment: