Liru / tumblr-downloader

A command-line program that scrapes tumblr blogs, and downloads images and videos from several at once.
MIT License
146 stars 31 forks source link

Make hardlink scan optional #43

Open apoapostolov opened 8 years ago

apoapostolov commented 8 years ago

I have migrated my Tumblr downloads to a new drive, which made duplicate files out of all hardlinks. This it seems is unavoidable using Windows explorer to copy content. I now use a tool called Duplicate Commander to analyze the copied content and delete duplicates and restore hardlinks. The whole process is frighteningly slow... 5.2m files may take multiple days to sift through.

This had me thinking, were the countless hours of waiting (even 1.5.3 takes 30-40 minutes before it starts) and HDD damage done to check all files prior launching the downloader worth it at the end? Just to save few hundred gigabytes in my case, maybe, but right now I wish I had the option to turn it off.

So I ask for an option to make hardlink scanning optional, on by default, but for heavy users like me who downloads 400+ blogs for several terabytes of data I could turn it off so the app works fast, and rely on other tools to clean up duplicates on a regular basis.

apoapostolov commented 8 years ago

So 3000 minutes later, my 2Tb Tumblr archive is down to 0.9Gb of unique content, the rest is to be re-hardlinked via Duplicate Commander.

I know you put a lot of thought and work into making the hardlinking functionality, but as a power user I must be one of those guys who stretch the limit of the implementation and I see more disadvantages than advantages to it. Please make it optional, it's desperately needed on my side.

apoapostolov commented 7 years ago

Thank you for considering this.

Off-Topic, Duplicates Commander proved to be very poor choice for cleaning and hardlinking. It near destroyed my HDD, and created a corrupted Recycle.Bin with 1m 0-byte files that took a night to clear.

I switched to a CloneSpy http://www.clonespy.com/ that can be set up to discover by name/size only to save time, but it can also be used to create checksum db to use for matching on later runs, can skip the Recycle.Bin and can be limited to only check new files downloaded last X days vs. all files in a directory. Really powerful freeware. Highly recommended along with Tumblr Downloader.