Serene-Arc / bulk-downloader-for-reddit

Downloads and archives content from reddit
https://pypi.org/project/bdfr
GNU General Public License v3.0
2.31k stars 212 forks source link

Duplicates are being fully downloaded before being identified as duplicates. #139

Closed StuffonGithub closed 3 years ago

StuffonGithub commented 4 years ago

Is there a way to make it so the CRC check, filename check, or whatever the software uses to check for dupes be done before actually downloading the whole file?

Due to the fact that there are some host-sided problems with gfycat and redgifs, running rechecks are taking longer than usual because more often than not, Bulk Downloader gets stuck downloading a file only to then find out the file already exists.

Thanks.

aliparlakci commented 4 years ago

Usually, it skips the posts if there is a file with matching filename. If you change the filename template, this feature will not work.

However, program can keep track of the downloaded files if you pass a text file directory through --downloaded-posts option. It will save the downloaded post's IDs and hashes to that file. If it encounters a post whose ID exists in the file, it will skip it. This way, it skips the already downloaded posts before re-downloading them.

But if the post is different but the file is the same, there is no way the script can identify the duplicate files without downloading them first. If you provide --no-dupes option, it will delete the file after downloading if its hash happens to be in the file given to the --downloaded-posts. If you do not provide --downloaded-posts, --no-dupes will only search among the hashes of the files which are downloaded by the current instance of the script.

linxchaos commented 4 years ago

I've fixed this in my repo. I added the URL of the post to the downloaded-posts lambda function. I need to push my latest changes too.

StuffonGithub commented 4 years ago

@aliparlakci I do have --downloaded-posts and --no-dupes in my commands so that's probably it. I'm no programmer obviously so I don't know the limitations here but would it not be possible to just keep track of the direct links of the download sources (i.e. i.imgur.com/R4nD0msTuFF.jpg) and use it to check for previously downloaded as opposed to having to download the whole file just to get a hash?

I know you already said this is out of your control but I'm still getting problems with gfycat and redgifs where the downloads just won't continue at all. And when I check the file it's stuck on downloading, more often than not, it's already a duplicate.

StuffonGithub commented 4 years ago

@linxchaos I'm not that awfully familiar with how stuff works on Github but I just read your commit notes. Really optimistic with the first part about fixing the freezing problem. And I feel like I just stole your idea when I commented my response to Ali but if at least these two things are addressed, it'd be a huge QoL improvement since there's really no other alternative out there that I can find.

linxchaos commented 4 years ago

I'm still getting issues with gfycat and redgifs, but it tries 3 times and if it doesn't work, then it continues on.

StuffonGithub commented 4 years ago

@linxchaos Ah, at the very least it will automatically skip it and just not get stuck while you're soundly sleeping. Lol.

I'm wondering about a potential method though but not exactly sure how to go about it. I just read that JDownloader2 is starting to support Reddit but not to the effect that Bulk Downloader does (full search results, users, and subreddits). The one I have in mind is some sort of a hybrid process.

So I was thinking maybe something in the lines of a Reddit link parser of sorts that basically just gets all the links that appear on a search, user profile, subreddit, etc. the way you can get them with parameters set in Bulk Downloader, as well as filenaming and stuff, but then will import all those links to JD2 to manage the downloads, directories, and all the other applicable bits. My main reasoning for this is because I haven't encountered the freezing problem I've had in Bulk Downloader with JD2 so far.

It's basically just a link parser. Collecting all links links in a user profile or subreddit. Importing them to JD2 can be done manually since that's quite easy to set up and do.

StuffonGithub commented 4 years ago

Hey, @linxchaos. Sorry for pinging you but just wanted to ask if there's a way to use Bulk Downloader with your commit? Looks like it will take a long time before the owner adds it to main release.

linxchaos commented 4 years ago

You can run it from source, which is almost exactly the same as running it from the EXE, just with more files.

You download the whole thing, then run script.py instead of the EXE file. Otherwise, all the same.

StuffonGithub commented 4 years ago

You can run it from source, which is almost exactly the same as running it from the EXE, just with more files.

You download the whole thing, then run script.py instead of the EXE file. Otherwise, all the same.

Thanks. I'm pretty dumb at this stuff. Every time I run script.py, it just opens a terminal then closes it after half a second. I can't see any other way to run it on the context menu and trying to manually run it through PS isn't working either.

Sorry for basically asking you for Support despite this not being your application. Anything I might be doing wrong? I have Python 3.8.5 installed btw.

linxchaos commented 4 years ago

Sorry! Just had time to answer this. You run the script.py in the command line with python with your arguments. For example:

python .\script.py --directory C:\Downloads\ --user spez --submitted --limit 1000 --sort new --no-dupes --skip video --downloaded-posts C:\RedditDownloaderLogs\spez_POSTS.txt --quit

StuffonGithub commented 4 years ago

Sorry! Just had time to answer this. You run the script.py in the command line with python with your arguments. For example:

python .\script.py --directory C:\Downloads\ --user spez --submitted --limit 1000 --sort new --no-dupes --skip video --downloaded-posts C:\RedditDownloaderLogs\spez_POSTS.txt --quit

No problem at all. You can answer whenever you can. :)

I downloaded the master from your fork but I'm getting:

Traceback (most recent call last):
  File ".\script.py", line 16, in <module>
    from prawcore.exceptions import InsufficientScope
ModuleNotFoundError: No module named 'prawcore'

whenever i run it.

Sorry, I'm really dumb with these things.

aliparlakci commented 4 years ago

I hate gfycat and its stupid side project redgifs. Redgifs does not even properly work on the browser.

aliparlakci commented 4 years ago

The reason I am implementing URL recognition feature that I cannot be sure the content in that URL stayed the same since the last download process.

StuffonGithub commented 3 years ago

I hate gfycat and its stupid side project redgifs. Redgifs does not even properly work on the browser.

I hate redgifs too. I'm not sure if you're still updating or looking into this but I saw that you can get a GIFs direct link with RedGIFs. I'm not sure how yet (I just saw both links shared) but maybe it'll help with getting it to download way easier?

Here's the link (NSFW btw): https://www.redgifs.com/watch/deafeningpointeddachshund

And the direct link: https://thcf1.redgifs.com/DeafeningPointedDachshund.mp4

ComradeEcho commented 3 years ago

I've made a PR to fix an issue with the program not recording duplicate files as "downloaded" when using --downloaded-posts. This change will prevent them from being re-downloaded on subsequent runs.

I'm not sure if there's any unintended side-effects to how I fixed this, but I would love to hear if there's any issues with this change.

171