ccloli / e-hentai-db

Just another E-Hentai metadata database
https://イー変態.ロリ.みんな
GNU General Public License v3.0
89 stars 13 forks source link

IP ban during sync #12

Open ReiFan49 opened 3 years ago

ReiFan49 commented 3 years ago

Hello, I triggered IP ban error during gallery ,metadata fetching (I think from 2.0M ID to 1.9M only at that point). Is there any solution aside syncing the huge thing? I kinda wonder about maintaining this as well. (If we need to talk this privately, Let me know as well, thanks!)

ccloli commented 3 years ago

I don't think it's a good idea to sync data if your dataset is too old (like starting from that big gdata.json). It may take a very long time to up-to-date, as for me it takes me more than 3 days (or say more than 72 hours) with hundreds of proxy IPs to update all galleries on 2020-03-16, and it's now about one and a half years later since that day, it may be worse to do that with single IP address. At that time I noticed that about 40 requests per minute won't ban your IP, but your experience may vary.

To be cleared that this repo is originally meant to load the original gdata.json file (and mostly for fun so the code is a messy and buggly). But since E-Hentai is backed that time, I added the update script to sync data. However the sync script is just a garbage, as it won't write to database until it grabs all the data it needs. Though when I wrote the sync script, my dataset is almost up-to-date, so at that time it's not an issue, and I setup a crontab task to update the data for each hour and each day, so I barely met your problem.

But "barely" means I have that issue, too. A few days ago I noticed my server was died continiously, found that my account was banned or something else that made my cookies invalid, so the script ran into an infinity loop and ate all the resources. After resolving that issue, my dataset is about 3 days behind, and it took about 1.5 hours to sync. So for you, it may take a long time (maybe about a month) to finish that (and your script should not be killed and your RAM should not be eaten up or you have to restart).

1 asked for latest database dump a year ago, so I dumped the diff part to him, but the file is expired, and unfortunately my HDD was died a few months ago and that dump data is not kept. If you do need a database dump, please let me know and I'll do that for you, since I need to schedule it to make sure the other things on that server will not be significantly affected, and it may takes some time since the example server is running on a server with poor performance.

ReiFan49 commented 3 years ago

hmmm so, what I do (which the changes are stored on private repo) in general is

for memory I think the import should be split like every certain milestone (lets say 1000 API call). apparently with 1 + 1.0sec delay after 100th got hit with it.

image

so basically to cater this issue, i need to slowly update data from page 10k, 9k, 8k, etc. ? added that offsetting to help me start with ID 1.6M/1.5M


also i thought you do "regular yearly dump" haha, i see by the other issue dump thing then

ccloli commented 3 years ago

Yep, if you can grab all the needed galleries, it's better to do the thing as ascending, so that you can easily to recover your progress when anything goes wrong (and much better if you write data directly into database instead of wait until finishes then write to a file, I use that way since I already wrote an import script so I don't want to duplicate it again, but that's a big issue when you're syncing a lot of galleries).

Or in other way, you can make a list of pending galleries, when a gallery is done, remove it from the list and write it to somewhere safe, and when error occurs, you can use that list to continue your progress (and yes, there's a fetch script to download galleries from specific file or from argv, and I use it to update specific gallery or import private galleries I found on Google, maybe you can modify it to save your progress to a file).

If you need to speed up (or prevent being banned too fast), you can find some HTTP proxy server to forward your request, I think some scripts are natively supported (by putting a .proxies file in the repo's root directory).


Well, I don't do any scheduled backup (and I don't expect anyone would use this lol), so when the user asked for a sql dump, I had to dump it manually (and removed unused fields on the server).

BTW I checked the database just now, the database in total is about 1.1 GB (but I added some indexes). If you're just using the data (the gallery search api is very slooooooooow), maybe it's better to optimize the structure or change the database engine (I use MyISAM but I think it's not a modern choose), or consider use other database engine like Oracle or ElasticSearch (and with a bunch of servers as cluster lol).

image

ReiFan49 commented 3 years ago

i'll do say at most ~1200 (may lower) page a day should do, and planning to use the proxy after this one, to reduce the "detection". I feel that api.php had "stronger load weight" so can't be used a lot in close timespan.

(it just sad that the source for initializing the DB is indeed that one and only mega link lol)


say, does changing gallery category into integer makes it "better" in terms of storage? was thinking about removing torrentcount support as well as i keep delete e.torrentcount; and perhaps will do some generalization script on those /app/action/

ccloli commented 3 years ago

does changing gallery category into integer makes it "better" in terms of storage?

It might be, I think E-Hentai stores it as integer, once you use the category filter on E-Hentai and checking the query string, you know what I mean, and that's why the api supports querying with combined bit (though the behaviour is opposite to E-Hentai, like for doujinshi only, E-Hentai is 1021, and the api is 2 or -1021): image

But I'm not using it since I don't know at that time, so I store it as-is.

was thinking about removing torrentcount support as well as i keep delete e.torrentcount;

Yep, e.torrentcount is useless, and I remember the E-Hentai's previous api.php doesn't returns torrent list, so I added that feature and noted in README that torrentcount is useless, as it's stored as-is, too.

I feel that api.php had "stronger load weight" so can't be used a lot in close timespan.

According to offical wiki, 4-5 requests per ~5 seconds is fine.


By the way I dumped my database just now, you can import it directly (or modify the structures before importing), so that you won't need to sync again (and stress E-Hentai's server). I thought it may take a long time to export, but it seems not that long.

https://github.com/ccloli/e-hentai-db/releases/tag/v0.3.0-29%2Bg1bac4cf

If you need to import the sql, it's better to backup your database first (or use a new database to import), since the sql is using INSERT, not INSERT IGNORE or ON DUPLICATE.

ReiFan49 commented 3 years ago

i remapped some of the numbers (i think private is below misc on mine) but dw with that lol, i can handle that on my own. otherwise thanks


i can replace the insert into replace so its easier to handle if needed. im sure those DB format still based on this repo as a whole right, i can tinker some of the dump statements by that :ok_hand:

ccloli commented 3 years ago

BTW, if you'd like to do a scheduled update, here is my crontab, and you can modify it for your case:

0 */1 * * * cd /var/www/e-hentai-db/ && npm run sync exhentai.org && npm run torrent-import exhentai.org && npm run torrent-sync exhentai.org 1
30 */6 * * * cd /var/www/e-hentai-db/ && npm run resync 48
15 */12 * * * cd /var/www/e-hentai-db/ && npm run sync exhentai.org 24

sync is for grabbing gallery list, resync is for updating recent galleries' score and tags etc, torrent-import is for grabbing torrents for new galleries, and torrent-sync is for garbing galleries from torrent list (optional, may useless if you don't mind the data you synced just now is not the latest), since the torrent list has newer galleries that haven't been listed on exhentai, but it's not a complete list, and it'll break the sync script, so you need a sync for last 24 hours.


One thing I forgot to mention is that for now I'm not having a strong willing to improved the current syncing script and SQL query statements, as thought it's a bit messy, it works for now. So at least for now the repo won't have any big updates, so you can do everything you want without considering merge newer features from this repo because there will be no newer features.

Though I've some idea but for now I don't have time and spirit to do it, if you'd like to do more things in your repo, there are some ideas (these are just ideas, not meaning you need to finish them, and I note them here is also for I may forget them in the future):

0xMana-git commented 5 months ago

idk if anyone is still working on this but here is an idea to improve the syncing process: binary search the pages to find the first unsynced page start syncing from that page incrementally, and write to db after each request

I'd implement it myself but I really dont wanna work with js 😭

0xMana-git commented 5 months ago

actually nvm I was wrong about how pages works, its probably way easier to just sync a limited amount of ids at a time with the next thing

0xMana-git commented 5 months ago

i know it looks ugly as fuck but i made thingy here