9-FS / nhentai_archivist

downloads hentai from nhentai.net and converts to CBZ
MIT License
114 stars 6 forks source link

[Enhancement] Wait and retry on 404 while getting gallery information from api #3

Closed shinji257 closed 1 month ago

shinji257 commented 1 month ago

Seems that the 404 is the result of nginx throttling connections (429 Too Many Requests) rather than just being not there. I tried some of the urls and were able to get data after waiting a minute. During the period when I couldn't get the data on those urls I also could not bring up the login screen.

9-FS commented 1 month ago

Is this a duplicate of issue #2 that you've just opened?

During metadata download, I for example consistently get error 404 here; doesn't matter if via the tool or manually via the browser. I assume some pages just statically return error 404 and the missed hentai ID can only be retrieved after waiting some time when they have shifted to other pages.

shinji257 commented 1 month ago

I click that link and get a bunch of metadata. That's my confusion here as the tool will consistently get a 404 but my browser will work fine. Same cookies provided.

9-FS commented 1 month ago

Weird, I still receive an error 404 both on my desktop via wifi and my phone via mobile internet.

The only thing I can currently recommend to mitigate the symptoms are to set a SLEEP_INTERVAL of at least 50.000 and then do at least 3 rounds with searches at different days. I have no idea how to solve the root issue and will mark this issue closed until someone reopens it with an idea.

wakiwakidon commented 1 month ago

When using API fails (the API response might be cached on server side), I think should fall back to scrape HTML

Wolf0006 commented 1 month ago

Hello. I've got an idea for a possible fix, but I need some input from you aswell. I'm someone who gets a LOT of these errors: WARN Downloading hentai metadata page 4 / 7 from "https://nhentai.net/api/galleries/search?query=tag:%22anthology%22+language:%22english%22&page=4" failed with status code 404 Not Found. I assume I get them more than usual because I don't see many people complaining about this, but I miss about 10% of downloads due to this. I waited 4 days for all the tags that I got errors on and I still get the same errors on the same pages. Waiting doesnt help. In this case, I wanted all english anthologies but got the error above. I decided to experiment and downloaded all japanese anthologies too (that I could due to errors). Then I combined english and japanese anthology into this: ['tag:"anthology"', '-language:"chinese"']. And added individual tags of the anthologies: ['tag:"anthology"', 'tag:"sister"' '-language:"chinese"'].

By searching individual tags of anthologies which previously gave me an error on 13% of works, I was able to download most of those 13%. Im guessing I changed the order of works in the pages which allowed me to download them by searching different tags of the same larger group? I assume works get sorted by release date, so when I changed up the tags and added japanese anthologies, I changed the order. Otherwise, I dont get how me changing up tags of the same larger group led to me downloading english anthologies that I couldn't before. I originally assumed there was one big metadata page group that only added new works released, ordering them by release date and didn't shift them when you changed tags. Something needs to be changing, because I did download most of the missing english anthologies just by changing tags.

PS, could you post or send me your metadata db so I could download all english works? Even if my method works, I am not able to get all of them, just reduce errors. Even if you also don't have all of them in the db due to errors, maybe you have those I don't.

9-FS commented 1 month ago

Yes, this pretty similar what I had also put together.

It is true that a particular page, in your case page 4, does not seem to work no matter how long you wait. But by waiting a day or more new entries cause the entry list to shift, making previously unavailable hentai ID available again. This is why I recommend to just set SLEEP_INTERVAL to 50.000 or more and let it run a couple of times. Unless anybody finds out how to fix the root problem, there is not much I can do.

Unfortunately, even my compressed database is too large for a GitHub attachment and I don't feel comfortable publicly sending links to my personal Nextcloud. As a compromise I have just generated you the downloadme.txt from my database.

I used this command:

SELECT Hentai.id
FROM Hentai
JOIN (SELECT hentai_id FROM Hentai_Tag WHERE tag_id = 12227) AS hentais_with_tag_desired
ON Hentai.id = hentais_with_tag_desired.hentai_id
ORDER BY Hentai.id;

downloadme.txt

Wolf0006 commented 1 month ago

It still gives me that same error when I try anthology english, page 4, but I managed to get those from page 4 by combining other tags. That was my point. You dont need to wait. Thank you for the downloadme. I believe it should work with individual tags.

9-FS commented 1 month ago

I understood that and I'm happy it works for you, but I think it's too complicated to become the new recommended way. Downloads take a while anyways, so there is not really any time wasted by telling people to just do multiple rounds of searching and downloading.