Add File Deduplication - Githubissues

9-FS commented 1 month ago

Tip: You can reduce that size quite a bit by not downloading duplicates. A significant portion of the size is from the larger multi-chapter doujins and a lot of them have individual chapters as well as combination of chapters in addition to the full doujin. When I implemented my offliner I added a duplicate check that groups doujins by the hash of their cover image and only downloads the content of those with the most pages, utilizing redirects for the duplicates. This managed to identify 12.6K duplicates among the 119K I've crawled, reducing the raw size to 1.31TiB of CBZs.

https://www.reddit.com/r/DataHoarder/comments/1fg5yzy/comment/ln2efs3/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

deathoftheages commented 1 month ago

I was told to copy this here:

Hey, I have been using your program since you posted it on reddit. I hope it's ok to request a feature here. The only issue I have had once you helped me figure out how to use nHA is the sheer amount of duplicates. I don't program, so I have no idea how hard it would be to have the program figure out which images are dupes before downloading them, but I have a feeling that would be a huge headache to implement.

I have the idea for what I think would be an easier workaround on your end. If you could make it, so there is a flag(I don't know the correct term) in the .env file so we could choose not to have the images put in a cbz file. Instead have it put each comic into its own subfolder along with the cbz xml file. What this would do is make it easy just to use a 3rd party duplicate file finder to scan the subfolders and delete the dupes.

Sure, you would still have to download the dupes, but it then becomes trivial to find and delete them. Sure it won't delete the xml files for the dupe folders, but once the pics are gone it is as simpler as sorting the folders by size and deleting the tiny ones. Then it would just be figuring out a way to batch making cbz files out of those subfolders after the de-duping.

This is just a thought. I know nothing about programming, so this might be something that isn't easy to do for your program or hell you might not feel like adding this kind of feature. Either way, I want to thank you for releasing it for free for everyone else. You didn't have to, but you made a lot of people's lives a lot easier in doing so.

I also thought of another solution, but I'm not sure if it is possible, do to not knowing how thumbnails are handled. But if there is a program that can use the cbz's thumbnails and compare them against each other, then allow you to choose the cbz with the largest file size to keep and deleting the rest seems possible in my ignorant, uneducated opinion.

9-FS commented 1 month ago

Thank you for ideas, I really appreciate it. :)

Currently, all images are temporarily downloaded to {LIBRARY_PATH}/{id}, copied to the CBZ, and then cleaned up. Implementing something like CLEANUP_TEMPORARY_IMAGES = false would indeed be trivial.

Speaking for that setting would be that someone else had already requested it for compatibility reasons with another program and ease of implementation.

Speaking against that would be adding more complexity into the ./config/.env for edge cases with not many users and I also don't know yet if I want my users to have to rely on third party software for deduplication.

The problems I see with the hash generation approach is that I don't think that it is reliable enough. What if 2 unrelated hentai have the same cover because they're from the same magazine or maybe they screwed up and both happen to have a blank first page? And how do I implement a redirect from the deleted works to the kept work? I probably would need to implement changing the kept work after it has already been stored in the library and that sounds like a headache.

These are just some thoughts I am having here before I decide into which direction I'm going to go. Let me know what you think.

billsargent commented 1 month ago

When creating the cbz, do you zip with compression or without? If you zip without, you could hash the entire cbz and discard it if it matches another. This would slow things down a bit I think. But it would keep duplication down if someone was just downloading the entire site.

9-FS commented 1 month ago

This would reliably work for complete duplicates, but not for series like Sweet Guy which is uploaded countless times with a varying number of chapters. And then there's still the redirection issue from deleted work to kept work to solve. I would be hesitant to change any library entries after they have been moved into the library, that's why if I decide to do this, the process should be well thought out.

I think I am doing bzip2 compression at the moment. Can anybody confirm? This should be the relevant source code line and this the relevant documentation.

That said, I really appreciate your ideas and active participation @billsargent. Thank you. :)

billsargent commented 1 month ago

I was looking at the specs for CBZ and here's the breakdown... CBZ should always be zip only, cb7 is 7zip, cba is ace, cbr is rar, cbt is tar. Your code ls creating zip files using deflate. If you could disable deflate and just use no compression, then theoretically, they should have the same hash.

Yeah as you said, some have different numbers of chapters but overall I think it would save space. The ones with varying numbers of chapters could then be left up to the user to fix themselves.

See this for how to create zips without compression. I think you are using zipwriter...

billsargent commented 1 month ago

oh I can't post links...?

https://docs.rs/zip/latest/zip/write/struct.ZipWriter.html

9-FS commented 1 month ago

I think I am repeating myself, but I still don't see a way how to properly redirect someone from a deleted hentai to the corresponding kept hentai. I think this whole deduplication topic is a can of worms I don't want to touch myself, because I don't have the feeling I could offer a solution that lives up to my reliability standards.

I'm currently seriously thinking about adding CLEANUP_TEMPORARY_IMAGES = false setting that would make deduplication for third party software easier.

billsargent commented 1 month ago

I think I am repeating myself, but I still don't see a way how to properly redirect someone from a deleted hentai to the corresponding kept hentai. I think this whole deduplication topic is a can of worms I don't want to touch myself, because I don't have the feeling I could offer a solution that lives up to my reliability standards.

I'm currently seriously thinking about adding CLEANUP_TEMPORARY_IMAGES = false setting that would make deduplication for third party software easier.

If you could place comicinfo in there as well, it would also help with creating the cbzs locally as well. That way the metadata is preserved. This would take the burden off you and others who have shell scripting or python scripting capabilities can do the deduplication.

9-FS commented 1 month ago

Good point. Then how about calling it CLEANUP_TEMPORARY_FILES instead?

billsargent commented 1 month ago

I agree. That sounds perfect.

9-FS commented 1 month ago

Deduplication has been decided to be out of scope of this project for now, setting CLEANUP_TEMPORARY_FILES has been added to faciliate easier deduplication via third party tools.

9-FS / nhentai_archivist

Add File Deduplication #6