ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.35k stars 134 forks source link

Enhancement idea: Automatic concurrency / delay management #85

Closed ethus3h closed 8 years ago

ethus3h commented 8 years ago

Sorry to have so many ideas :P I'm thinking, it would be useful to have options like: --concurrency=auto and --delay=auto, so that grab-site would start with 1 connection, and if it doesn't have networkerrors, it would increase the concurrency and/or delay until it did, at which point it would reduce it again, etc..

ivan commented 8 years ago

This would be hard to specify precisely, let alone program and test. And I don't think I would ever have the motivation to do it in grab-site.

Of course, if someone is supremely motivated to implement and test this, don't let me stop you...

This would probably also require writing a test server that spits out different kinds of errors.

ethus3h commented 8 years ago

Hm. I might try my hand at it eventually; I'm kinda still tossing around the idea of a distributed-computing crawler so I might try this then.

ivan commented 8 years ago

Let me know if you get started on that and I'll go into your bug tracker and give you all of my ideas

ethus3h commented 8 years ago

I guess I'll close this issue now since I guess it's kind of unlikely to happen for grab-site at the moment.

(Basically I'd think something like this would work:

  1. get last 20 responses from wpull.log
  2. $errorRate gets percentage that had network errors
  3. ($errorRate > 10%) ? concurrency-- : { if(0.5 > rand(0,1)) concurrency++ }