Serene-Arc / bulk-downloader-for-reddit

Downloads and archives content from reddit
https://pypi.org/project/bdfr
GNU General Public License v3.0
2.29k stars 211 forks source link

Issue downloading large video/gif files #142

Closed Tomaster134 closed 3 years ago

Tomaster134 commented 4 years ago

I'm trying to download some mp4 files that are anywhere between 30 and 90 megabytes, and some of them seem to fail, leaving only a temporary file in the save location.

From the CONSOLE_LOG file

ContentTooShortError: <urlopen error retrieval incomplete: got only 3964178 out of 59719970 bytes> See CONSOLE_LOG.txt for more information ERROR:root:ContentTooShortError Traceback (most recent call last): File "C:\Users\Ali\AppData\Local\Programs\Python\Python38\lib\site-packages\cx_Freeze\initscripts__startup.py", line 40, in run File "C:\Users\Ali\AppData\Local\Programs\Python\Python38\lib\site-packages\cx_Freeze\initscripts\Console.py", line 37, in run File "script.py", line 351, in File "script.py", line 337, in main File "script.py", line 155, in download File "script.py", line 95, in downloadPost File "D:\projects\bulk-downloader-for-reddit\src\downloaders\redgifs.py", line 26, in init File "D:\projects\bulk-downloader-for-reddit\src\downloaders\downloaderUtils.py", line 85, in getFile File "C:\Users\Ali\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 286, in urlretrieve urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only 22084883 out of 90531622 bytes> ERROR:root:ContentTooShortError Traceback (most recent call last): File "C:\Users\Ali\AppData\Local\Programs\Python\Python38\lib\site-packages\cx_Freeze\initscripts__startup__.py", line 40, in run File "C:\Users\Ali\AppData\Local\Programs\Python\Python38\lib\site-packages\cx_Freeze\initscripts\Console.py", line 37, in run File "script.py", line 351, in File "script.py", line 337, in main File "script.py", line 155, in download File "script.py", line 95, in downloadPost File "D:\projects\bulk-downloader-for-reddit\src\downloaders\redgifs.py", line 26, in init File "D:\projects\bulk-downloader-for-reddit\src\downloaders\downloaderUtils.py", line 85, in getFile File "C:\Users\Ali\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 286, in urlretrieve urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only 22445339 out of 76206839 bytes> ERROR:root:ContentTooShortError Traceback (most recent call last): File "C:\Users\Ali\AppData\Local\Programs\Python\Python38\lib\site-packages\cx_Freeze\initscripts__startup__.py", line 40, in run File "C:\Users\Ali\AppData\Local\Programs\Python\Python38\lib\site-packages\cx_Freeze\initscripts\Console.py", line 37, in run File "script.py", line 351, in File "script.py", line 337, in main File "script.py", line 155, in download File "script.py", line 95, in downloadPost File "D:\projects\bulk-downloader-for-reddit\src\downloaders\redgifs.py", line 26, in init__ File "D:\projects\bulk-downloader-for-reddit\src\downloaders\downloaderUtils.py", line 85, in getFile File "C:\Users\Ali\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 286, in urlretrieve urllib.error.ContentTooShortError: <urlopen error retrieval incomplete: got only 3964178 out of 59719970 bytes>

DennisPing commented 4 years ago

I have noted this issue as well here: https://github.com/aliparlakci/bulk-downloader-for-reddit/issues/130

The bug is coming from the host server (gfycat) which periodically closes the download's web socket. The author thinks gfycat does this on purpose. Nothing can be done about this unfortunately. I personally use Motrix to grab the skipped videos/gifs.

Symbiomatrix commented 3 years ago

The downloader uses urllib (basically the most barebones package python has to offer for web access), so it's not too surprising. Requests with stream=true handles large file downloads better, and youtube-dl (which supports gfycat amongst others) has an even greater advantage of being able to continue downloading from the last point of failure. Video downloads without those are a waste of bandwidth. Besides that, IIRC there are / were several redundant page loads for metadata scattered throughout the code; that would exacerbate the problem.