Serene-Arc / bulk-downloader-for-reddit

Downloads and archives content from reddit
https://pypi.org/project/bdfr
GNU General Public License v3.0
2.29k stars 211 forks source link

[BUG] Imgur 404 error but link works in browser #869

Open luckybear992 opened 1 year ago

luckybear992 commented 1 year ago

Description

imgur links keep giving a 404 error even though they work on my browser. An imgur link such as https://i.imgur.com/xxxxxx.gifv opens up on my browser. https://i.imgur.com/xxxxxx WITHOUT the gifv extension loads a 404 page. The two 404 links in the log I provided work fine on my browser using the i.imgur link that ends with .gifv extension

Command

python3 -m bdfr download L:\bdfr --subreddit thatsthespot --no-dupes

Environment

Logs

[2023-05-31 09:50:26,348 - bdfr.connector - DEBUG] - Setting maximum download wait time to 120 seconds
[2023-05-31 09:50:26,348 - bdfr.connector - DEBUG] - Setting datetime format string to ISO
[2023-05-31 09:50:26,349 - bdfr.connector - DEBUG] - Disabling the following modules: 
[2023-05-31 09:50:26,349 - bdfr.connector - Level 9] - Created download filter
[2023-05-31 09:50:26,349 - bdfr.connector - Level 9] - Created time filter
[2023-05-31 09:50:26,349 - bdfr.connector - Level 9] - Created sort filter
[2023-05-31 09:50:26,350 - bdfr.connector - Level 9] - Create file name formatter
[2023-05-31 09:50:26,350 - bdfr.connector - DEBUG] - Using unauthenticated Reddit instance
[2023-05-31 09:50:26,351 - bdfr.connector - Level 9] - Created site authenticator
[2023-05-31 09:50:26,802 - bdfr.connector - DEBUG] - Added submissions from subreddit thatsthespot
[2023-05-31 09:50:26,803 - bdfr.connector - Level 9] - Retrieved subreddits
[2023-05-31 09:50:26,803 - bdfr.connector - Level 9] - Retrieved multireddits
[2023-05-31 09:50:26,803 - bdfr.connector - Level 9] - Retrieved user data
[2023-05-31 09:50:26,803 - bdfr.connector - Level 9] - Retrieved submissions for given links
[2023-05-31 09:50:38,557 - bdfr.downloader - DEBUG] - Attempting to download submission 13w4i73
[2023-05-31 09:50:38,558 - bdfr.downloader - DEBUG] - Using Imgur with url https://i.imgur.com/DnZYrnB.gifv
[2023-05-31 09:50:38,750 - bdfr.downloader - ERROR] - Site Imgur failed to download submission 13w4i73: Server responded with 404 to https://imgur.com/DnZYrnB
[2023-05-31 09:50:38,751 - bdfr.downloader - DEBUG] - Attempting to download submission 13vrada
[2023-05-31 09:50:38,751 - bdfr.downloader - DEBUG] - Using Redgifs with url https://redgifs.com/watch/parchedvalidhog
[2023-05-31 09:50:38,939 - bdfr.downloader - DEBUG] - File L:\bdfr\thatsthespot\twitchrule_She is really really cute when she want that cum_13vrada.mp4 from submission 13vrada already exists, continuing
[2023-05-31 09:50:38,939 - bdfr.downloader - INFO] - Downloaded submission 13vrada from thatsthespot
[2023-05-31 09:50:38,940 - bdfr.downloader - DEBUG] - Attempting to download submission 13vc2l6
[2023-05-31 09:50:38,940 - bdfr.downloader - DEBUG] - Using Redgifs with url https://www.redgifs.com/watch/pointlesscanineibizanhound#rel=user%3Aariacolexo;order=new
[2023-05-31 09:50:39,105 - bdfr.downloader - DEBUG] - File L:\bdfr\thatsthespot\ariacole___I am an expert at finding just the right spot.._13vc2l6.mp4 from submission 13vc2l6 already exists, continuing
[2023-05-31 09:50:39,106 - bdfr.downloader - INFO] - Downloaded submission 13vc2l6 from thatsthespot
[2023-05-31 09:50:39,106 - bdfr.downloader - DEBUG] - Attempting to download submission 13vnb2c
[2023-05-31 09:50:39,106 - bdfr.downloader - DEBUG] - Using Imgur with url https://i.imgur.com/SScXYtM.gifv
[2023-05-31 09:50:39,306 - bdfr.downloader - ERROR] - Site Imgur failed to download submission 13vnb2c: Server responded with 404 to https://imgur.com/SScXYtM
michaeljaeger95 commented 1 year ago

I also cannot download anything via Imgur, regardless of file type despite the link working as intended in the browser.

Except for me the error is (for every download):

Site Imgur failed to download submission xxxxxx: server responded with 404 to https://api.imgur.com/3/image/yyyyyyy

ElleEllie commented 1 year ago

Can confirm as well, that I too can't download anything from the imgur.

Barborica-Alexandru commented 1 year ago

Yes and navigating to the link in a browser will unveil the reason: error | "Authentication required" It would seem the API has been gated. The error reported by BDFR is a 404, even though the actual error is a 401. This might be a bug in the code, unrelated to this issue.

OMEGARAZER commented 1 year ago

Yes and navigating to the link in a browser will unveil the reason: error | "Authentication required" It would seem the API has been gated. The error reported by BDFR is a 404, even though the actual error is a 401. This might be a bug in the code, unrelated to this issue.

The reason you're getting 401 from that link is the same reason I mention in #828 you're missing the auth headers to access that API link.

As for the rest of the issue at hand here, There are a lot of things being removed from Imgur right now. It seems they're being removed from the API first and the direct file links will sometimes work for a bit afterwards. You can work around this for direct links with an edit to the download_factory but I would not advise it long term as any dead link will just pick up the removed image and treat it like it's been successful. Also any malformed links provided by the Reddit API can just download the HTML of the 404 page as the downloader will not see the redirect and think it's getting the right file. It's the main reason the change to the API was made in the first place.

If you are willing to run with those caveats or are willing to double-check them all here is the patch:

change this:

        if re.match(r"(i\.|m\.|o\.)?imgur", sanitised_url):
            return Imgur
        elif re.match(r"(i\.|thumbs\d{1,2}\.|v\d\.)?(redgifs|gifdeliverynetwork)", sanitised_url):
            return Redgifs
        elif re.match(r"(thumbs\.|giant\.)?gfycat\.", sanitised_url):
            return Gfycat
        elif re.match(r".*/.*\.[a-zA-Z34]{3,4}(\?[\w;&=]*)?$", sanitised_url) and not DownloadFactory.is_web_resource(
            sanitised_url
        ):
            return Direct

to this:

        if re.match(r"(i\.|thumbs\d{1,2}\.|v\d\.)?(redgifs|gifdeliverynetwork)", sanitised_url):
            return Redgifs
        elif re.match(r"(thumbs\.|giant\.)?gfycat\.", sanitised_url):
            return Gfycat
        elif re.match(r".*/.*\.[a-zA-Z34]{3,4}(\?[\w;&=]*)?$", sanitised_url) and not DownloadFactory.is_web_resource(
            sanitised_url
        ):
            return Direct
        elif re.match(r"(i\.|m\.|o\.)?imgur", sanitised_url):
            return Imgur

Any gifv links will download as such with that change. If you would like them downloaded as mp4 you can insert the two new lines to downloader at line 96:

        try:
            if submission.url.endswith(".gifv"):
                submission.url = submission.url.replace(".gifv", ".mp4")
            downloader_class = DownloadFactory.pull_lever(submission.url)

These edits are provided as-is and I won't be providing additional support for them.

Barborica-Alexandru commented 1 year ago

Yes and navigating to the link in a browser will unveil the reason: error | "Authentication required" It would seem the API has been gated. The error reported by BDFR is a 404, even though the actual error is a 401. This might be a bug in the code, unrelated to this issue.

The reason you're getting 401 from that link is the same reason I mention in #828 you're missing the auth headers to access that API link.

As for the rest of the issue at hand here, There are a lot of things being removed from Imgur right now. It seems they're being removed from the API first and the direct file links will sometimes work for a bit afterwards. You can work around this for direct links with an edit to the download_factory but I would not advise it long term as any dead link will just pick up the removed image and treat it like it's been successful. Also any malformed links provided by the Reddit API can just download the HTML of the 404 page as the downloader will not see the redirect and think it's getting the right file. It's the main reason the change to the API was made in the first place.

If you are willing to run with those caveats or are willing to double-check them all here is the patch:

change this:

        if re.match(r"(i\.|m\.|o\.)?imgur", sanitised_url):
            return Imgur
        elif re.match(r"(i\.|thumbs\d{1,2}\.|v\d\.)?(redgifs|gifdeliverynetwork)", sanitised_url):
            return Redgifs
        elif re.match(r"(thumbs\.|giant\.)?gfycat\.", sanitised_url):
            return Gfycat
        elif re.match(r".*/.*\.[a-zA-Z34]{3,4}(\?[\w;&=]*)?$", sanitised_url) and not DownloadFactory.is_web_resource(
            sanitised_url
        ):
            return Direct

to this:

        if re.match(r"(i\.|thumbs\d{1,2}\.|v\d\.)?(redgifs|gifdeliverynetwork)", sanitised_url):
            return Redgifs
        elif re.match(r"(thumbs\.|giant\.)?gfycat\.", sanitised_url):
            return Gfycat
        elif re.match(r".*/.*\.[a-zA-Z34]{3,4}(\?[\w;&=]*)?$", sanitised_url) and not DownloadFactory.is_web_resource(
            sanitised_url
        ):
            return Direct
        elif re.match(r"(i\.|m\.|o\.)?imgur", sanitised_url):
            return Imgur

Any gifv links will download as such with that change. If you would like them downloaded as mp4 you can insert the two new lines to downloader at line 96:

        try:
            if submission.url.endswith(".gifv"):
                submission.url = submission.url.replace(".gifv", ".mp4")
            downloader_class = DownloadFactory.pull_lever(submission.url)

These edits are provided as-is and I won't be providing additional support for them.

Oh i understand now. Some of the submissions where very recent so I hadn't considered they could already be removed.

AlexTu2 commented 1 year ago

or are willing to double-check them all here is

@OMEGARAZER

Is there a way to figure out which files need to be double checked? Then a way to save the corresponding file to the right location, named and all?

miguel7501 commented 1 year ago

@AlexTu2

or are willing to double-check them all here is

@OMEGARAZER

Is there a way to figure out which files need to be double checked? Then a way to save the corresponding file to the right location, named and all?

bdfr has the --no-dupes option that promises to avoid downloading the same image/video twice by comparing hashes. Since the 'removed' image is the same every time, that option catches it. You'll just get one of them and bdfr will skip all other posts that were removed by imgur.

I'm currently re-downloading my saved posts with this fix and the --no-dupes option, the log displays "Resource hash d835884373f4d6c8f24742ceabe74946 from submission downloaded elsewhere" messages every now and then so I'm confident it's working.

Serene-Arc commented 1 year ago

Plus the images are all exactly the same (absurdly low) size. It's easy to use a tool like find to get them all.