Chillee / coursera-dl-all

MIT License
190 stars 54 forks source link

ContentTooShortError #7

Closed lightbrush closed 8 years ago

lightbrush commented 8 years ago

Hi dude! After I downloaded the videos, the script always return such errors for every course I selected. (I tried different courses to check the bug, the common part is urllib.error.ContentTooShortError. I don't know what it is.)

My environment:

Logging In....
Logged in!
[('https://class.coursera.org/pgm-003/wiki/view?page=CourseSchedule', 'CourseSch
edule'), ('https://class.coursera.org/pgm-003/wiki/view?page=CourseInformation',
 'CourseInformation'), ('https://class.coursera.org/pgm-003/wiki/view?page=Cours
eStaff', 'OurTeam'), ('https://class.coursera.org/pgm-003/wiki/view?page=CourseL
ogistics', 'CourseLogistics'), ('https://class.coursera.org/pgm-003/wiki/view?pa
ge=OctaveInstallation', 'OctaveInstallation'), ('https://class.coursera.org/pgm-
003/wiki/view?page=LectureSlides', 'LectureSlides'), ('https://class.coursera.or
g/pgm-003/questions', 'QuickQuestions15'), ('https://class.coursera.org/pgm-003/
class/index', 'Home'), ('https://class.coursera.org/pgm-003/assignment/index', '
ProgrammingAssignments'), ('https://class.coursera.org/pgm-003/forum/index', 'Di
scussionForums'), ('https://class.coursera.org/pgm-003/wiki/view?page=FAQList',
'FAQ')]
Traceback (most recent call last):
  File "dl_all.py", line 290, in <module>
    download_sidebar_pages(session)
  File "dl_all.py", line 229, in download_sidebar_pages
    download_all_zips_on_page(session, path)
  File "dl_all.py", line 123, in download_all_zips_on_page
    urllib.request.urlretrieve(url, path+url[url.rfind('/'):])
  File "D:\Anaconda3\lib\urllib\request.py", line 228, in urlretrieve
    % (read, size), result)
urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only
 1430256 out of 1432748 bytes>
lightbrush commented 8 years ago

So I got a possible answer.

Because I use shadowsocks to connect the internet, sometimes the connection is not stable. Thus some pages couldn't be fully loaded at the first time(stuck in loading permanently). Perhaps this leaded the process of downloading failing and stoped the whole script finally.

Would you like to write a function to refresh the page and try to downlod several times when the "urllib.error.ContentTooShortError" appears?

Many thanks~

lightbrush commented 8 years ago

Hi Chille, Now I'm pretty sure that the problem is my unstable internet status. As the document introduces:

urlretrieve() will raise ContentTooShortError when it detects that the amount of data available was less than the expected amount (which is the size reported by a Content-Length header). This can occur, for example, when the download is interrupted.

The Content-Length is treated as a lower bound: if there’s more data to read, urlretrieve() reads more data, but if less data is available, it raises the exception.

You can still retrieve the downloaded data in this case, it is stored in the content attribute of the exception instance.

If no Content-Length header was supplied, urlretrieve() can not check the size of the data it has downloaded, and just returns it. In this case you just have to assume that the download was successful.

Is it possible to restart download untill the script get the correct content?

Though I'm not familiar with Python, I think these findings might be useful for you. Hopefully this script could help more people who live in the areas of bad internet connection.

Thanks a lot

Chillee commented 8 years ago

Is it able to download any files, or does it always fail on the first file?

Chillee commented 8 years ago

Can you try replacing lines 120-127

try:
                if sys.version_info >= (3, 0):
                    urllib.request.urlretrieve(url, path+url[url.rfind('/'):])
                else:
                    urllib.urlretrieve(url, path+url[url.rfind('/'):])
            except urllib.error.HTTPError:
                print("Failed to download "+url)
                continue

with

r = requests.get(url)
            with open(path+url[url.rfind('/'):], 'wb') as f:
                f.write(r.content)

You'll probably need to fix formatting.

lightbrush commented 8 years ago

Hi Chille,

This problem appeared randomly. The download worked at first and always stopped at some point.

For example,when the script tries to download the "Lecture Slides" part from https://class.coursera.org/pgm-003/, the script will stop downloading on a random pdf file(I find the order of download is from top to bottom. The download will stop unpredictably) and return the ContenTooShort Error. Due to this, I think the reason might be the unstable internet connection.

Now I'm trying your new code. Most of the time, it works well~ The script downloaded all the materials I need of several courses except Quiz and Assignment(because I don't enter -q -a). COOL!!! Thanks very much! But rarely it returned such error:

Traceback (most recent call last):
  File "dl_all.py", line 288, in <module>
    download_sidebar_pages(session)
  File "dl_all.py", line 227, in download_sidebar_pages
    download_all_zips_on_page(session, path)
  File "dl_all.py", line 121, in download_all_zips_on_page
    r = requests.get(url)
  File "D:\Anaconda3\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "D:\Anaconda3\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 468, in reque
st
    resp = self.send(prep, **send_kwargs)
  File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 570, in send
    adapter = self.get_adapter(url=request.url)
  File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 644, in get_a
dapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'javasc
ript:window.location.reload(false);'

And I find another potential problem, sometimes the Firefox will stuck at this situation, which makes the script pause.

qq 20160621232636

One more suggestion, for download suffix, you may add '.r'(R script), '.xls', '.xlsx', '.csv'(Excel file).


Update:

Traceback (most recent call last):
  File "dl_all.py", line 288, in <module>
    download_sidebar_pages(session)
  File "dl_all.py", line 227, in download_sidebar_pages
    download_all_zips_on_page(session, path)
  File "dl_all.py", line 121, in download_all_zips_on_page
    r = requests.get(url)
  File "D:\Anaconda3\lib\site-packages\requests\api.py", line 69, in get
    return request('get', url, params=params, **kwargs)
  File "D:\Anaconda3\lib\site-packages\requests\api.py", line 50, in request
    response = session.request(method=method, url=url, **kwargs)
  File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 468, in reque
st
    resp = self.send(prep, **send_kwargs)
  File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 570, in send
    adapter = self.get_adapter(url=request.url)
  File "D:\Anaconda3\lib\site-packages\requests\sessions.py", line 644, in get_a
dapter
    raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'javasc
ript:window.location.reload(false);'

This error is not unpreditable. It only happens on this page https://class.coursera.org/compfinance-009/questions


Update:

I find this error is caused by the pop of suffix'.r, '.xls', '.xlsx', '.csv', yet I don't know why as well....

Chillee commented 8 years ago

I'll add .r to the download list, but I'm not really sure what to do about the connection issue. I had inconsistency issues with downloading as well when I was using a VPN.

lightbrush commented 8 years ago

Thanks Chillee~ At least most problems I met had been solved :) Thanks for your great script~

mikechen66 commented 3 years ago

It is quick to download the weight by the wget. It has the recursive download feature to download everything under the specified directory.

$ wget –r [url]

ZERO2ER0 commented 2 years ago

Can you try replacing lines 120-127

try:
                if sys.version_info >= (3, 0):
                    urllib.request.urlretrieve(url, path+url[url.rfind('/'):])
                else:
                    urllib.urlretrieve(url, path+url[url.rfind('/'):])
            except urllib.error.HTTPError:
                print("Failed to download "+url)
                continue

with

r = requests.get(url)
            with open(path+url[url.rfind('/'):], 'wb') as f:
                f.write(r.content)

You'll probably need to fix formatting.

Works for me! Thank you!