LAW-Unimi / BUbiNG

The LAW next generation crawler.
http://law.di.unimi.it/software.php#bubing
Apache License 2.0
85 stars 24 forks source link

FetchingThreads seem to hang and do nothing #3

Closed guillaumepitel closed 7 years ago

guillaumepitel commented 7 years ago

I have observed several times that BUbiNG starts slowing down at some point. I am under the impression that FetchingThreads starts behaving oddly and do not fill their role anymore. When I change the number of fetchingThreads manually, then suddenly todoSize decreases and readyToParse increases.

So here are some graphs when I play with "fetchingThreads". In the first half, before I change anything the todoSize keeps increasing. Then I increase the number of threads, decrease it, increase and decrease it again and, victory, the readyToParse curve have increased, first with a few spikes, then finally in a "normal" behaviour. after_activefetchingthreads

after_todosize

after_readytoparse

So is it possible that fetchingThreads somehow get stuck and don't fetch anything anymore ?

guillaumepitel commented 7 years ago

After more tests, it seems that the main reason the crawls are slowing down is because the workbench is full. In the case specified above, I'm not sure the problem resides in the threads getting stuck, or if it's the relative low number of fetching threads that made it unable to deal with the todo list growing rate.

vigna commented 7 years ago

A useful piece of information would be a global stack trace taken in the moment in which everything slows down. Then we can see what the threads are doing.

guillaumepitel commented 7 years ago

So I've tried this but without success. For now, my best guess is that slowdown actually occur when workbench size is greater than workbench maxbytesize.

vigna commented 7 years ago

Well, if that happens it means that you need a larger workbench. Politeness and workbench size must be somehow balanced.

I think you know it, but to get a stack trace you need just to SIGQUIT the process or press CTRL-.

What are currently your politeness and workbench size?

guillaumepitel commented 7 years ago

I have tried several settings, with several machines / machine numbers (I use EC2 clusters)

What currently works for me : 16 machines with 16 cores, 122GB RAM / Politeness 6sec / WB size : 90GB

It also depends on the size of the seed. Larger seeds tend to make the crawl slowdown faster. My current seed is 50M urls. Typical crawl speed is 500GB/h