Serene-Arc / bulk-downloader-for-reddit

Downloads and archives content from reddit
https://pypi.org/project/bdfr
GNU General Public License v3.0
2.31k stars 213 forks source link

Resource hash from submission downloaded elsewhere [BUG?] #957

Closed KayJay95 closed 4 months ago

KayJay95 commented 6 months ago

It's not really a bug however some files download to an unknown location.

I just wanted to ask where these end up.

Description

When I run the download for Reddit I sometimes get this message: [2024-05-06 16:18:35,535 - bdfr.downloader - INFO] - Resource hash 3212b939476797933c6480962e92413f from submission 1be2qd3 downloaded elsewhere

This is just one of the ones I get, the hash number and the submission number change.

Command

python3 -m bdfr download F:\Bulk-Download\Reddit\images --opts opts_image.yaml --search-existing --no-dupes

.yaml file

skip: [mp4, avi, mov, gif]
time: all
upvoted: true
authenticate: true
user: [me]

Environment

Logs

This is long as I downloaded about 100 images, sorry in advance 😅 (I've deleted a section in the middle so it's not too long)

[2024-05-06 15:58:43,036 - bdfr.connector - DEBUG] - Disabling the following modules: 
[2024-05-06 15:58:43,036 - bdfr.connector - Level 9] - Created download filter
[2024-05-06 15:58:43,036 - bdfr.connector - Level 9] - Created time filter
[2024-05-06 15:58:43,037 - bdfr.connector - Level 9] - Created sort filter
[2024-05-06 15:58:43,065 - bdfr.connector - Level 9] - Create file name formatter
[2024-05-06 15:58:43,066 - bdfr.connector - DEBUG] - Using authenticated Reddit instance
[2024-05-06 15:58:43,516 - bdfr.oauth2 - Level 9] - Loaded OAuth2 token for authoriser
[2024-05-06 15:58:43,964 - bdfr.oauth2 - Level 9] - Written OAuth2 token from authoriser to C:\Users\Admin\AppData\Local\BDFR\bdfr\default_config.cfg
[2024-05-06 15:58:44,485 - bdfr.connector - Level 9] - Resolved user to DonOwU
[2024-05-06 15:58:44,485 - bdfr.connector - Level 9] - Created site authenticator
[2024-05-06 15:58:44,485 - bdfr.connector - Level 9] - Retrieved subreddits
[2024-05-06 15:58:44,486 - bdfr.connector - Level 9] - Retrieved multireddits
[2024-05-06 15:58:44,688 - bdfr.connector - DEBUG] - Retrieving upvoted posts of user DonOwU
[2024-05-06 15:58:44,689 - bdfr.connector - Level 9] - Retrieved user data
[2024-05-06 15:58:44,689 - bdfr.connector - Level 9] - Retrieved submissions for given links
[2024-05-06 15:58:44,816 - bdfr.downloader - INFO] - Calculating hashes for 2498 files
[2024-05-06 16:00:42,353 - bdfr.downloader - DEBUG] - Attempting to download submission 1cksdle
[2024-05-06 16:00:42,354 - bdfr.downloader - DEBUG] - Using Gallery with url https://www.reddit.com/gallery/1cksdle
[2024-05-06 16:00:48,196 - bdfr.file_name_formatter - Level 9] - Formatting filename with index 1
[2024-05-06 16:00:48,200 - bdfr.file_name_formatter - Level 9] - Formatting filename with index 2
[2024-05-06 16:00:48,204 - bdfr.file_name_formatter - Level 9] - Formatting filename with index 3
[2024-05-06 16:00:48,207 - bdfr.file_name_formatter - Level 9] - Formatting filename with index 4
[2024-05-06 16:00:48,210 - bdfr.file_name_formatter - Level 9] - Formatting filename with index 5
[2024-05-06 16:00:48,213 - bdfr.file_name_formatter - Level 9] - Formatting filename with index 6
[2024-05-06 16:00:48,216 - bdfr.file_name_formatter - Level 9] - Formatting filename with index 7
[2024-05-06 16:00:48,219 - bdfr.file_name_formatter - Level 9] - Formatting filename with index 8
[2024-05-06 16:00:48,223 - bdfr.file_name_formatter - Level 9] - Formatting filename with index 9
[2024-05-06 16:00:49,096 - bdfr.downloader - DEBUG] - Written file to F:\Bulk-Download\Reddit\images\yiff\charai1126_I love this artist (waspsalad) [fm]_1cksdle_1.png
[2024-05-06 16:00:49,101 - bdfr.downloader - DEBUG] - Hash added to master list: c7a14645582934865c4eaadc8ce221dc
[2024-05-06 16:00:50,518 - bdfr.downloader - DEBUG] - Written file to F:\Bulk-Download\Reddit\images\yiff\charai1126_I love this artist (waspsalad) [fm]_1cksdle_2.png
[2024-05-06 16:00:50,522 - bdfr.downloader - DEBUG] - Hash added to master list: 83c98b1ab5ee81f5ce44e3732740ab28
[2024-05-06 16:00:51,749 - bdfr.downloader - DEBUG] - Written file to F:\Bulk-Download\Reddit\images\yiff\charai1126_I love this artist (waspsalad) [fm]_1cksdle_3.png
[2024-05-06 16:00:51,753 - bdfr.downloader - DEBUG] - Hash added to master list: 6c488da5e32f5e3214813a2189cdde31
[2024-05-06 16:00:52,156 - bdfr.downloader - DEBUG] - Written file to F:\Bulk-Download\Reddit\images\yiff\charai1126_I love this artist (waspsalad) [fm]_1cksdle_4.jpg

(This section was deleted due to length)

[2024-05-06 16:19:02,743 - bdfr.downloader - DEBUG] - Submission 1bdo83f filtered due to URL https://i.redd.it/cr3qfwu2x2oc1.gif
[2024-05-06 16:19:02,743 - bdfr.downloader - DEBUG] - Attempting to download submission 1bdrkfe
[2024-05-06 16:19:02,744 - bdfr.downloader - DEBUG] - Using Direct with url https://i.redd.it/b0opg9hys3oc1.jpeg
[2024-05-06 16:19:03,637 - bdfr.downloader - INFO] - Resource hash 9e702ad4cd9241cfffca0938612f63be from submission 1bdrkfe downloaded elsewhere
[2024-05-06 16:19:03,638 - bdfr.downloader - DEBUG] - Attempting to download submission 1bd5t9z
[2024-05-06 16:19:03,638 - bdfr.downloader - DEBUG] - Using Direct with url https://i.redd.it/qv555jpndync1.jpeg
[2024-05-06 16:19:04,182 - bdfr.downloader - INFO] - Resource hash 7da50bc88c738a422fb451277cdd05c5 from submission 1bd5t9z downloaded elsewhere
[2024-05-06 16:19:04,183 - bdfr.downloader - DEBUG] - Attempting to download submission 1bd4mgc
[2024-05-06 16:19:04,183 - bdfr.downloader - DEBUG] - Using Direct with url https://i.redd.it/4ndjxdnj4ync1.jpeg
[2024-05-06 16:19:04,616 - bdfr.downloader - INFO] - Resource hash 0f0a6d4729cde5648a29dbebf2844471 from submission 1bd4mgc downloaded elsewhere
[2024-05-06 16:19:04,616 - bdfr.downloader - DEBUG] - Attempting to download submission 1bdzgsb
[2024-05-06 16:19:04,617 - bdfr.downloader - DEBUG] - Using Direct with url https://i.redd.it/6jc8qkyfe5oc1.jpeg
[2024-05-06 16:19:05,141 - bdfr.downloader - INFO] - Resource hash cc1a4c305888be00d7414c8da2cf5add from submission 1bdzgsb downloaded elsewhere
[2024-05-06 16:19:05,141 - bdfr.download_filter - Level 9] - Url "https://i.redd.it/flk54dkwi0oc1.gif" matched with "re.compile('.*(mp4|avi|mov|gif)$')"
[2024-05-06 16:19:05,142 - bdfr.downloader - DEBUG] - Submission 1bdg6w8 filtered due to URL https://i.redd.it/flk54dkwi0oc1.gif
[2024-05-06 16:19:05,142 - bdfr.downloader - DEBUG] - Attempting to download submission 1bdonh8
[2024-05-06 16:19:05,142 - bdfr.downloader - DEBUG] - Using Direct with url https://i.redd.it/btdl9b2j03oc1.png
[2024-05-06 16:19:05,732 - bdfr.downloader - INFO] - Resource hash 3d57a89c70593091edfda274b2bd33e0 from submission 1bdonh8 downloaded elsewhere
[2024-05-06 16:19:05,733 - root - INFO] - Program complete
Saortica commented 6 months ago

Are you sure this isn't the de-dupe in action, ie, it's saying that it's already been downloaded to that folder and is then skipped? I can't test right now, but you should be able to verify by loading the url of a skipped file and cross checking with what came down. If there are too many files/folders to visually check (and assuming it has a different filename), you could manually download and use a dupe file utility like dupeGuru.

Serene-Arc commented 4 months ago

You're using the --search-existing option (don't, it's not good), so it's searching all of your already-downloaded files and then not writing a second file if it's an exact match. They might be files from different posts or subreddits, but are the same file nonetheless. If you don't want this behaviour, don't use --search-existing and --no-dupes.