RicterZ / nhentai

nhentai doujinshi downloader
http://nhentai.net
MIT License
843 stars 120 forks source link

Galleries with broken thumbnail urls are generating .webp files with html content. #349

Open maltbeverage opened 1 week ago

maltbeverage commented 1 week ago

After webp images started showing up, I noticed a few galleries were pulling in broken webp images. On closer inspection, the downloaded files contain html that show a 404 error.

Example 538028, the first thumbnail is referencing an invalid url:

/galleries/3115455/1t.jpg.webp

Looks like an issue with nhentai. I can remove the .webp extension from the thumbnail url

/galleries/3115455/1t.jpg

and the thumbnail image will load in.

The broken thumnail links to page 1 of the doujin and does indeed have a working image:

/galleries/3115455/1.jpg

So this broken thumnail might be messing up the parsing somehow. When comming accross these broken thumbnails, I think it attempts to download

/galleries/3115455/1.webp

which does not exist, the actual file is

/galleries/3115455/1.jpg

and this successfully saves as a .webp file, but the contents are the html of the 404 error.

<html>
<head><title>404 Not Found</title></head>
<body>
<center><h1>404 Not Found</h1></center>
<hr><center>nginx</center>
</body>
</html>
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->
<!-- a padding to disable MSIE and Chrome friendly error page -->

I'm thinking this might be a parsing logic issue if the thumbnail url is somehow used to determine the file extension of the downloaded image file.

This only affects around 5 galleries at the moment.

maltbeverage commented 1 week ago

Forgot to add, I'm on the latest commit, f30ff59.

maltbeverage commented 1 week ago

Found it I think: https://github.com/RicterZ/nhentai/blob/f30ff59b2ba62338bcd3281cd957d1f9b89c705f/nhentai/parser.py#L156

If I'm reading it right, this explains the parsing of the invalid webp url. In this case it's a broken thumbnail url, but if there was ever a case where nhentai implemented a different thumbnail extension from the actaul doujin image extension, that might bork the downloader.

I suppose the alternative would be to follow each url on the gallery page and extract the image url one at a time, but that sounds a bit expensive. Maybe error checking for a 404 code when downloading and failing with an error might be a good way to go. I'd rather have a failure on rare edge cases than an archive with missing images.

RicterZ commented 1 week ago

I'll check it out

RicterZ commented 1 week ago

Need more sample doujinshi, https://t5.nhentai.net/galleries/3115455/1t.jpg.webp returns 403

RicterZ commented 1 week ago

After some investigations, I found that:

Need to determine whether it is an isolated case or the norm.

maltbeverage commented 1 week ago

Here are all of the codes with bad thumbnail image urls. I've scraped everything released recently to check.

538005 538006 538020 538028 538045

Thanks for taking as a look.

maltbeverage commented 1 week ago

I noticed some more galleries with issues:

538053 538058 538063 538087 538088 538090 538098 538148 538159

Looks like this'll be an issue until it's fixed on the nhentai side.

Here is a quick and dirty workaround if anyone needs to get this working:

https://github.com/maltbeverage/nhentai/commit/ea52cff2ad5eaad6de8aa71acdb52317dc78cd02

This should string split the two extensions and then use the first one.... probably introducing new edge cases with this, but it works for now.

DeadlyShadow71 commented 1 week ago

Not sure if my problem is related, if not, please ignore and I will open a new Issue - since the downtime of nhentai earlier this week I can't download certain doujinshis, I haven't updated the script until earlier today to try and fix it but it keeps happening in the same way.

[11:56:00] doujinshi_parser: Fetching doujinshi information of id 538003 [11:56:01] doujinshi_parser: Tried yo get image id failed

It stays there for some time and just dies, I use the favorites method but even when trying to download just that one, same result. Not knowledgeable enough in either python or the scripts interaction with the site to know what might cause it, so I'm not sure if the same workaround would work for me.

maltbeverage commented 1 week ago

Not sure if my problem is related, if not, please ignore and I will open a new Issue - since the downtime of nhentai earlier this week I can't download certain doujinshis, I haven't updated the script until earlier today to try and fix it but it keeps happening in the same way.

[11:56:00] doujinshi_parser: Fetching doujinshi information of id 538003 [11:56:01] doujinshi_parser: Tried yo get image id failed

It stays there for some time and just dies, I use the favorites method but even when trying to download just that one, same result. Not knowledgeable enough in either python or the scripts interaction with the site to know what might cause it, so I'm not sure if the same workaround would work for me.

Not the same issue as this one. nHentai started using webp images which did not have support for parsing until https://github.com/RicterZ/nhentai/commit/f30ff59b2ba62338bcd3281cd957d1f9b89c705f was commited a couple days ago.

If you git clone and install from source, it should work. I'm not sure if this fix has been pushed out to any other install methods.

DeadlyShadow71 commented 1 week ago

I'm using the nhentaiGUI, so I just edited the couple of lines in the files I have, works now, thanks for the info and fix. Solved on my part.

poohzaza166 commented 3 days ago

i have the same problem it seem like it happen with any doujin uploaded recently ie: 538703 but when running the parser module as a standalone python file the code seem to run normally

print(doujinshi_parser("538703"))