Closed guillaumepitel closed 7 years ago
After more tests, it seems that the main reason the crawls are slowing down is because the workbench is full. In the case specified above, I'm not sure the problem resides in the threads getting stuck, or if it's the relative low number of fetching threads that made it unable to deal with the todo list growing rate.
A useful piece of information would be a global stack trace taken in the moment in which everything slows down. Then we can see what the threads are doing.
So I've tried this but without success. For now, my best guess is that slowdown actually occur when workbench size is greater than workbench maxbytesize.
Well, if that happens it means that you need a larger workbench. Politeness and workbench size must be somehow balanced.
I think you know it, but to get a stack trace you need just to SIGQUIT the process or press CTRL-.
What are currently your politeness and workbench size?
I have tried several settings, with several machines / machine numbers (I use EC2 clusters)
What currently works for me : 16 machines with 16 cores, 122GB RAM / Politeness 6sec / WB size : 90GB
It also depends on the size of the seed. Larger seeds tend to make the crawl slowdown faster. My current seed is 50M urls. Typical crawl speed is 500GB/h
I have observed several times that BUbiNG starts slowing down at some point. I am under the impression that FetchingThreads starts behaving oddly and do not fill their role anymore. When I change the number of fetchingThreads manually, then suddenly todoSize decreases and readyToParse increases.
So here are some graphs when I play with "fetchingThreads". In the first half, before I change anything the todoSize keeps increasing. Then I increase the number of threads, decrease it, increase and decrease it again and, victory, the readyToParse curve have increased, first with a few spikes, then finally in a "normal" behaviour.
So is it possible that fetchingThreads somehow get stuck and don't fetch anything anymore ?