bibanon / BASC-Archiver

Python-based Imageboard (4chan) complete thread archiver.
https://pypi.python.org/pypi/BASC-Archiver/
135 stars 18 forks source link

[Suggestion] Make a list of already downloaded files in a thread so as not to download them again #32

Open HASJ opened 8 years ago

HASJ commented 8 years ago

I routinely run a dupe check, which has once freed up to 9GBs, and it is weird that the archiver can't detect that.

DanielOaks commented 8 years ago

That's odd. My best guess is that it could be images in different threads that are the same. On 4chan, it obviously doesn't let you repost the same image, however with archiving you can have 40 copies of the same image because it was reposted in 40 threads on different days.

This is an interesting thing to think about, whether it's worth looking into something along the lines of (hard) symbolic links or something similar, will need us to store a list of files and at 1/2 hashes of them. Will definitely look into it, thanks for making the issue!

jcook12 commented 8 years ago

HasJ, I'm interested in that dupe check. Are you using md5s and looping through each file, deleting matches?

HASJ commented 8 years ago

@jcook14 Exactly and deleting the oldest matches, using an old but goody app, DoubleKiller.

jcook12 commented 8 years ago

@HASJ Awesome app, thanks! I will integrate this into my board ripper - I am curious if you have addressed re-linking the deleted thumbs/pics in the relevant thread html file (if you keep the markup structure that is).