9-FS / nhentai_archivist

downloads hentai from nhentai.net and converts to CBZ
MIT License
108 stars 6 forks source link

[Idea] Use web.archive.org as a possible final fallback for images #39

Open shinji257 opened 3 days ago

shinji257 commented 3 days ago

In the event that the API returns the existance of a manga but you still get 404s for images try using web.archive.org as well to get those images. This can happen if the item in question did exist while api was being retrieved but was subsequently deleted so now it only exists in the db. I had recently found (while the service was up) that mangas can exist on the site. You already have the metadata by this point (usually) so I'm thinking that maybe hitting up the site for images may be viable when they stablize things.

I think (from what I understand) you can prefix the full url with https://web.archive.org/web like https://web.archive.org/web/https://i.nhentai.net/galleries/819208/4.jpg and it will grab the most recent copy that the service has but won't be able to test implementation and viability until the service is back up and running again. Apparently it went back down again today.

The actual image url is something like https://web.archive.org/web/{datecode}if_/https://i.nhentai.net/galleries/819208/4.jpg. It seems arbitrarily using this url will redirect to the same place updating the placeholder with the correct data.

https://web.archive.org/web/00000000000000if_/https://i.nhentai.net/galleries/819208/4.jpg

shinji257 commented 3 days ago

Testing my implementation using id 135474. This id was deleted from the website but still remains in my db from the last pass. Result: success Log: https://gist.github.com/shinji257/8a7ad40ad18f196edd85ebd3fbf6bf72

9-FS commented 3 days ago

This is a pretty cool idea. In my experience though, the actual images don't get deleted from the media servers. It's only the metadata / gallery information that gets purged, so only that would need to be rerouted to the web archive. This should also greatly reduce the strain on the web archive.

I will put this on the pile of features I want to implement in the future.

shinji257 commented 3 days ago

The code I submitted as a possible PR has it so it only uses the web archive in the event that all media servers fail. I think so anyways and appears to be that way based on the log output above as I can see it cycling through the media servers before it goes there. Anyways thanks for your reply. ;)

9-FS commented 3 days ago

Btw I would like to express my gratitude for your willingness to contribute and that you're so active on this project. I just have a lot going on currently career wise, but I will come back to this project for all the enhancements that have piled up eventually!

shinji257 commented 2 days ago

So this can work but it doesn't always work. In fact I'm getting a pretty low success rate. Some fails are partial pulls. If this gets added then it should fast fail by stopping after X number of errors that way it doesn't keep spamming IA if it can't get the whole thing from there anyways.