Open mindjek07 opened 2 years ago
I cant believe this is not what the author meant by "avoid duplicates". I ended up with tones of duplicated images simply cause they have different titles. This makes the program kind of useless for me. I hope you can add this in the future
Avoid duplicates actually works by storing downloaded URLs and not re-downloading content at a URL that has previously been downloaded. It has nothing to do with the title.
This issue is not as simple as it appears. Most image/video host sites do not make an MD5 hash, or any hash for that matter, available before content is downloaded. So the content must be downloaded, then hashed, then compared to previously downloaded and hashed content, then deleted if it is found to be a duplicate. This is a feature that I plan to implement in future versions, but it is far from the ideal duplicate avoidance that most users would expect to be possible.
I used to use https://github.com/shadowmoose/RedditDownloader and i'm not sure it downloads the image to know if they are in fact duplicates. Maybe it does...
Edit : Actually it does you're correct https://github.com/shadowmoose/RedditDownloader/blob/62a98c658b5759a2acdbbfa7a58cd6e842aaf71f/redditdownloader/processing/post_processing.py#L17
Is your feature request related to a problem? Please describe. Duplicate images
Describe the solution you'd like Store MD5 hash data of every image