Bionus / imgbrd-grabber

Very customizable imageboard/booru downloader with powerful filenaming features.
https://www.bionus.org/imgbrd-grabber/
Apache License 2.0
2.48k stars 216 forks source link

Unconscionable Grabber use problem solving discussion #1023

Open MasterPetrik opened 7 years ago

MasterPetrik commented 7 years ago

6a00d83451b36c69e201bb092bb2d4970d-600wi There is a couple of problems, which happend because of current grabber way of work. I mean unlimied freedom of users to do everything with every source listed in the Grabber. This IS a very big problem, because of not fair use of Grabber by some users there are many problems for other grabber users, site users, and even Grabber developer.

Here I show some of these problems:

Without any limits, every grabber user can easily make a DOS attack on the sources, even without evil intent. And several Grabber users can even make a DDOS attack. And in current Grabber version there is nothing that can stop them from this.

Almost 80% of all sended crashes of grabber happens when user is trying to download an extreamly big amount of images(all images with very popular tag like "touhou" or all images from one booru, or even all images from all boorus). It means that fair-use users almost not face with such problems, but not-fair-users are causing problems for them by themselfs and causing problems for Dev, who need to check every single crash.

Even one user that tries to download all images from booru is equivalent to hundreds or fair-use grabber users and thousands of normal site users. And now, according to the crash reports, there are much more then one such users!

So big parasite server load from these users disturbs source owners and motivate them to ban downloads for all grabber users, and even for everyone(like case with sankaku and iqdb.org). So because of small amount of not-fair-use users suffers all other fair-use users.

And that's not the full list of serious problems, caused by unlimited using of Grabber. This is classic game theory problem caused by unconscionable use of the limited common resource. https://en.wikipedia.org/wiki/Tragedy_of_the_commons So I want to discuss, how to solve these problems, before it's too late.

ravenlord1 commented 7 years ago

HTTrack (https://github.com/xroche/httrack) is similar in function. It has settings to adjust dl rate and number of connections per second. Maybe add something similar, and with upper limit caps.

MasterPetrik commented 7 years ago

Here is the several ideas how to solve this problem:

  1. Make a warning message for the very big mass download, for example bigger then 5000 or 10000 images, that will tell user about Denial of responsibility in cases of problems, caused by using Grabber this way.

  2. forbid by force small timing values between server requests for mass download.

  3. Forbid to mass download ALL images with empty search request(all images from all sources) more then one sourse at once. That should not be affected to empty search request all source massdownload if total images count not enough big(>10000 for example). I think that will be very good idea, that will limit bad users parasite traffic, but still give them ability to download everything from everywhere...someday.

  4. forcely set different downloads settings with much bigger timings for problem sites like Sankaku without ability to lower values.

  5. Forbid to download more then one extreamly big mass download. With no limits to amount of normal mass downloads(<10000 pics)

  6. Make very big timings for extreamly big batch downloads, I mean really big, and inform users about it in same warning message. This is the core obvious suggestion to prevent dos\ddos by some new unexpiriensed or bad users. But that will require some other changes in grabber download process: For example, a variant of gradient scale: first 1000 images have 1 sec timing next 1000 images have 2 sec timing next 1000 images have 3 sec timing ... next 1000 images have 9 sec timing all next images downloads have limit 1 image per 10 seconds.

That means 360 images per hour, 8640 images per day, 60480 images per week and 241920 images per mounth and 3 000 000+ images per year. Which means that user follow these rules could download every booru except sankaku(6 kk images) less than a year.

But more important, that any increasing time limits will make much more important a couple of grabber problems. Long in time mass downloads needs:

  1. crash-proof work of downloader for big images lists.
  2. well-working ability to auto-resume download process in any cases, and auto redownload all problem images, and ability to autosolve all downloading exeptions.
  3. scaleable download process, mostly independent from images count.
  4. built-it downloads window into downloads tab in grabber(in current downloads tab, bottom half of the tab mostly not used, so all download process info can ne placed there)
  5. ability to minimize window to tray and continue all downloads in that state.
Bionus commented 7 years ago

I totally agree with the fact that this should be addressed. Even though in a perfect world, that would be done on the server-side of things using API keys and API limiting, we have to take responsibility here for creating a lot of traffic.

Most ideas here are good, but what would stop the user from clearing its settings (or wherever this "limit counter" is stored) then start downloading again? Or, since the program is open source, remove the limits and compile it himself?

In the end, all those solutions will not apply to some part of the user population. That is however acceptable I believe, as it would already highly reduce the load on servers for 99% of users.


Here is the several ideas how to solve this problem:

  1. Sure, but most users will simply ignore it and wonder why their download is so slow.
  2. Isn't it the same as (6)?
  3. But if we were to already have (6), why not let them do? Nothing would stop them from starting the other download directly after the first anyway. In any case, it would be very slow.
  4. Certainly.
  5. Same remark as (3). Since it's slow anyway, might as well let them do it, with a warning from (1).
  6. Good idea. While another equivalent solution is the one suggested by @BarryMode: pause every 1,000 images. Both have pros and cons.

Long in time mass downloads needs:

I wish I could make those all true. 😅 Most of these are long-standing issues which I've been working on the past few months and it's starting to see the light, however.


I however believe that those limits should depend not necessarily on the number of images (in both suggestions here 1,000) but instead on the size of the download.

Limiting to something like 1 GB/h for example seem to better match the actual wishes of the website owners, that is, to save bandwidth, since the CPU issues ("lots of requests") will be solved by the changes provided by #995.

Bionus commented 7 years ago

Also, another thing I was wondering was: wouldn't imposing limits on the users simply make them switch to another program without those, effectively making the changes useless for the website owners?

I know Hydrus was mentioned a lot of times in this issue board. How does it limit its downloads? (if it does)

MasterPetrik commented 7 years ago

since the CPU issues ("lots of requests") will be solved by the changes provided by #995.

I'm not really undersood, what you did in #995 to solve server requests problem?(your speech was too technical for me) I guess it's something connected with new tag system. Can you please tell about this in a dew words for me?

  1. forbid by force small timing values between server requests for mass download.

Isn't it the same as (6)?

No, in this string I mean that you can forbid users to lower default timings in source settings, with no restrictions for higher timing changes. If for example you choose to set 1 sec timing between requests by default in Grabber, there should be no way for user to lower it.

since the program is open source, remove the limits and compile it himself?

There is not so many such expiriensed users who can do it, and even if they will do it, they will be just small persent of all users who WANT to do that but CAN'T do that. So it will anyway affect to most of bad users I guess.

what would stop the user from clearing its settings (or wherever this "limit counter" is stored) then start downloading again?

Just store THESE settings not in .txt or .ini but in some encripted file, that can't easyly be edited by users anyway that part is need more thinking...

In the end, all those solutions will not apply to some part of the user population.

Let's think. Almost all guys who downloads all images from boorus don't really know, what they will do with all of them. They do that because they CAN do it. If you can download 6kk images, why don't give it a try, eh? That's how they think. Most of people who want to download more then 10000 images from one tag, not really gonna to view them all. They view some images (100-5000) and don't view all of images. So all these users don't really know what they want. And for them not a big difference, 10k images or 100k images they are downloaded. They will be fine anyway. Even if you give them ability to download billions of images, for them it's still will be not enough. That's why these users should be limited in their desires. Most of users use Grabber to download small amount of data, like all images from one artist, or from one copyright. In that case any artist tags consists <3000 images and biggest copyright tag(touhou) consists ~600 000 images, but average copyright tag consist not more then 20 000 images, so these users are the main fair-use users of Grabber, they know what they want and they can limit itself. And big amount of Grabber users(and I'm too) don't use batch downloads at all or use it very rarely, because they use Grabber as multibooru viewer and view and saves images manually from grid. They don't want to download not viewed images or don't want to store not needed images that they are not like. That part of users causes less troubles then others. If you limit users from first category, it will not cause them uninstall grabber right now, most of them just want to grab all they can, so if you will limit them with speed but not with amount of images, it will be fine for most of them anyway. Or lower speed will force them to think, did they really need to download so many images. And most image maniacs will find the way to use unlimited downloads in grabber anyway. But better to limit 95% of bad users then 0%, right?

I know Hydrus was mentioned a lot of times in this issue board. How does it limit its downloads? (if it does)

Well, I personally don't know, but all actions with images in Hidrus are very slow and everywhere you can see warning messages that any change in settings will cause a crash, so I don't think that their downloader can handle with so big amount of images at all.

Also, another thing I was wondering was: wouldn't imposing limits on the users simply make them switch to another program without those, effectively making the changes useless for the website owners?

Anyway, batch download is not the only one main function of Grabber or Hidrus Network. Grabber is a more multibooru viewer without advertisement then just downloader, and Hidrus is mostly images database then downloader. So even if downloader function will be limited, it will not so big problem for both of us.

Anyway downloading more then 20 000 images in one batch is still unstable in all current downloaders including grabber. So if you limit it first, and then will make it more stable, you will save all your users, because other downloaders have the same stability problems with it now.

MasterPetrik commented 7 years ago

While another equivalent solution is the one suggested by @BarryMode: pause every 1,000 images. Both have pros and cons.

Pause every XXX images is a very easy to implement solution, but it has a bunch of minuses:

  1. Still consist DOS ability if downloading time of many users would be the same
  2. Unequal server usage: high load periods turns with no load periods
  3. No benefits for small batches users
  4. No penalty for extreamly big batch downloarers
  5. Bad situations, when user downloads 1001 images and will wait bunch of time to download last image

No, we need here the system that will increase penalty with rising of images count. And increasing should be soft to not to disurb users. They need to understand that they able to download something big, but they will sacrifise with time. So they will start big downloads only if they are REALLY want to download it, and ready to limits. In other cases this system will help users to choose less bigger downloads to download them faster. Such systems are very good at teaching users to fair-use thinking. Moreover this way reduce server load in time, and make it predictable from server part.

Bionus commented 7 years ago

I'm not really undersood, what you did in #995 to solve server requests problem?(your speech was too technical for me) I guess it's something connected with new tag system. Can you please tell about this in a dew words for me?

Currently, most sources (pretty much all except Danbooru and e621) do not provide the tag types in their API. Therefore, when batch downloading, if your filename contains something like "%copyright%" Grabber has no idea which tags are characters, copyrights or whatever. So it does an additional query to the image details page ("/post/show/{id}" on Danbooru sources), where the tag information is present.

With the change (which will likely not yet be fully implemented in 5.5.0 so I can guarantee a release this weekend), Grabber will keep a local database of all tags for each source. Therefore, even if the API does not give the tag type information, Grabber can directly check the types in the local tag database. The problem I have left to solve for this feature is tag changes, that is if a tag type is changed on the website after it's already in the local database.

Just store THESE settings not in .txt or .ini but in some encripted file, that can't easyly be edited by users anyway that part is need more thinking...

Yes, it will require some thinking. A file is not a very good solution as simply removing it or replacing it by the default one (during installation) could simply reset the limits. Writing to some obscure part of the Windows registry could work on Windows, but not on other platforms (less used but should still be taken in condideration).

There's also the option to simply not store anything or ignore the fact that the file could be removed: the user could restart/delete the file to reset the limit for batch downloads, but that'd be such a hassle for batch downloads in the first place (running during the night or in the background) that it could be ignored for most users.

And I totally agree with the rest of your post. :+1:

Bionus commented 7 years ago

The problem is that Grabber can be blocked. Not technically easily, but it's possible.

That's what we're seeing on Sankaku, which is now non-working for Grabber (and others downloaders). So the non-downloading users or those who download very little are now being "punished" even though they did little harm to the website.

While Sankaku is a special case (very bad API which caused Grabber to do much more requests than other sources to function properly), it's possible that some limits may need to be enforced. What's very important is that they should only apply to users that download a lot.

brazenvoid commented 7 years ago

I agree that downloading hundreds of thousands of images is hurtful for the servers and in fact makes the downloader's life hell (whether they realize this or not). It takes me a whole lot of time to sift through them. I can only do 3000 or so images a day.

But that's all because there isn't a very good system in place for managing tags. Or downloading images with a huge number of tags which are set to block and allow. That can be done in the app, much like how the blacklist is managed. I presume, because Bionus is currently focused on fixing bugs, he hasn't thought about such a system yet.

If such had been in the app already, I would never ever would have downloaded the >100k images that I have downloaded. As only about 20-40 images per a batch of ~1500 seem to pass my appraisal.

Also due to the lack of intelligence or history function, I have many a times downloaded already seen images. Most of those were due to crashes and such, road blocks that needed changes in my searching patterns, making them more general. But that also has effectively shut me off from the majority of old stuff that I wanted to download.

All of these had irked me a lot in time. Making documents with tag combinations and count values etc. Having a list of more than a hundred tag combinations and managing it every week is very time consuming.

Finally, I found making my own app with a substantially powerful rules based tag filter system to be beneficial in the long run. It also has tag caching, much like what Bionus is currently testing.

Tags can also be configured to actually limit search sets through minimum tag limits or inclusion of certain tags depending on already specified tags.

Bionus commented 7 years ago

But that's all because there isn't a very good system in place for managing tags. Or downloading images with a huge number of tags which are set to block and allow. That can be done in the app, much like how the blacklist is managed. I presume, because Bionus is currently focused on fixing bugs, he hasn't thought about such a system yet.

If such had been in the app already, I would never ever would have downloaded the >100k images that I have downloaded. As only about 20-40 images per a batch of ~1500 seem to pass my appraisal.

I don't quite understand your suggestion @brazenvoid, maybe you can elaborate some more? Isn't there already a blacklist system in place?

Also due to the lack of intelligence or history function, I have many a times downloaded already seen images. Most of those were due to crashes and such, road blocks that needed changes in my searching patterns, making them more general. But that also has effectively shut me off from the majority of old stuff that I wanted to download.

How would you present the history in a manner that woulds prevent you from re-downloading or re-searching the same stuff? And isn't the MD5 list enough to prevent re-downloading the same stuff?

Finally, I found making my own app with a substantially powerful rules based tag filter system to be beneficial in the long run. It also has tag caching, much like what Bionus is currently testing.

Can it be found somewhere? :smile:

Tags can also be configured to actually limit search sets through minimum tag limits or inclusion of certain tags depending on already specified tags.

Same here, I'm not sure I totally understand this feature.

brazenvoid commented 7 years ago

I don't quite understand your suggestion @brazenvoid, maybe you can elaborate some more? Isn't there already a blacklist system in place?

Well for example a source actually only allows a certain number of tags to be searched in an overall result. But you can add further tag support by making them searched just like the black list is used to block them.

That way only tags will be matched and the server will not be burdened by the added download nor the user will have to sift through the whole downloaded set.

This though will only benefit people who are willing to take the extra effort to identify what they like through tag combinations.

How would you present the history in a manner that woulds prevent you from re-downloading or re-searching the same stuff? And isn't the MD5 list enough to prevent re-downloading the same stuff?

The MD5 list, if is indeed there to prevent it, I haven't experienced it working that way. If I do a search, download and remove images from the folder, they come back when I do the search and download again. FYI, I have enabled the option to keep the deleted files in MD5 list.

Can it be found somewhere? 😄

Not yet, its a personal project after all. And seldom worked upon due to my professional responsibilities. I will, however, present some solutions when I get it to a stabler state.

Same here, I'm not sure I totally understand this feature.

You can make the user introduce some size limiting tags (height/width, might be source dependent) or just ask him to add more, if the downloadable image count / size reaches a certain limit.

Bionus commented 7 years ago

Well for example a source actually only allows a certain number of tags to be searched in an overall result. But you can add further tag support by making them searched just like the black list is used to block them.

That way only tags will be matched and the server will not be burdened by the added download nor the user will have to sift through the whole downloaded set.

This though will only benefit people who are willing to take the extra effort to identify what they like through tag combinations.

So adding the current "post-filtering" feature to batch downloads (it is currently only supported in browsing with the "+" button). I'll see to add that for batches. :+1:

The MD5 list, if is indeed there to prevent it, I haven't experienced it working that way. If I do a search, download and remove images from the folder, they come back when I do the search and download again. FYI, I have enabled the option to keep the deleted files in MD5 list.

Sounds like a bug in the "keep deleted files in MD5 list" setting then. I'll take a look.

You can make the user introduce some size limiting tags (height/width, might be source dependent) or just ask him to add more, if the downloadable image count / size reaches a certain limit.

I see. So if the user starts a download of more than N images, suggest it to refine its search to generate less results.

DJPlaya commented 7 years ago

I wonder who actually cares about Someone downloading massive amounts of Files. Of course its kinda an Dos Attack, but i see no Reason too add a Hardlimit here (as example UDP Unicorn is still available on Sourceforge lel). I really dislike the latest Update, and since this is open Source i may will be forced to remove the Threadlimit in a Fork - Else, it will just take ages to download all this Stuff, which i actually really need for my own Site lel -Sad Life-

Bionus commented 7 years ago

I really dislike the latest Update, and since this is open Source i may will be forced to remove the Threadlimit in a Fork - Else, it will just take ages to download all this Stuff, which i actually really need for my own Site lel

There isn't even a limit in the program yet. 😕

DJPlaya commented 7 years ago

So i can just set the Max. Threads in the Config instead of the limited Gui?

brazenvoid commented 7 years ago

@DJPlaya For example, Gelbooru imposes a simultaneous download limit of 6 files at a time, no matter how many connection threads your program makes. Even in error state, when grabber tries to download 10s of files, it still can't start receiving more than 6 files at a time...

DJPlaya commented 7 years ago

@brazenvoid Lets use Sandboxie with many Proxys ;)