LAW-Unimi / BUbiNG

The LAW next generation crawler.
http://law.di.unimi.it/software.php#bubing
Apache License 2.0
85 stars 24 forks source link

WorkbenchEntry-based scheduling #16

Closed guillaumepitel closed 6 years ago

guillaumepitel commented 6 years ago

What follows is a bit the opposite of the other issue I raised about enforcing per-IP delay with the per host partitioning... so I guess I changed my mind about the priorities :)

This is more an open discussion than an issue per se, but here is what I have observed : BUbiNG's architecture is based on a WorkbenchEntry-based scheduling, a WorkbenchEntry contains the list of VisitStates related to a given IP address, in order to be able to enforce IP based politeness.

A problem occurs because it turns out that quite a large number of websites resolve to the same address. For instance all hosts in *.blogspot.com resolve to the same address. That turns out to be a problem because the Workbench is filled by these whales that take up all the available space (and the schemeAuthority2VisitState is also filled by them, consequently), so at the end, only a few WorkbenchEntry remain, and the crawl is limited by the per-IP politeness (maybe also by other limiting factors, even though I don't think the acquire/release mechanism is a problem here).

Basically, it boils down to this : as time pass, the crawl will be slower and slower because of the workbench size limit and those huge workbench entries.

I see three possible solutions :

I believe only the first option actually works. I've found conflicting opinions on the matter, but having a strong per-IP politeness seems not to be the most commonly accepted rule. So I guess we could replace the IP politeness with a IP-politeness powerfactor alpha. alpha=1 is equivalent to a zero IP delay, alpha=0 is the opposite, with IP delay = host delay.

guillaumepitel commented 6 years ago

Actually, lowering the IPdelay is probably not enough because of the acquire/release mechanism on workbenchentries, which take at least a page fetching + parsing time.

guillaumepitel commented 6 years ago

After a bit of work and testing this week-end, here are my solutions for apparently solving "slowing down" (by the way, I think the problems I had with JGroups were more a symptom than a cause) :

So here is my solution :

  1. create a ByteArrayDiskQueue to serve as a "safety valve", in the Frontier (similar to readyURLs queue)
  2. in the Distributor, when a new URL is processed and it would lead to a new VisitState, check if the total number of VisitStates is less than a threshold. If it's not the case, add the URL to the safety valve queue and do nothing
  3. in the DNSThread, before calling setWorkBenchEntry on the visitState found after successful DNS resolution, test if there is not already a large number of hosts in the WBE. If not, "purge the visitstate" out of the usual purgin mechanism (we don't want to set the schemeAuthority2Count), and re-enqueue the URLs in the visitstate into the safety valve queue.
  4. In the distributor, in addition to processing the readyURLs queue, process the safety valve queue to empty it

For now my implementation of these anti-OOM / GC explosion is not super clean. I would like your opinion on the matter to see if it's worth submitting a PR

guillaumepitel commented 6 years ago

Ok, found the last (I hope) memory "leak" problem, and it's not strictly related to BUbiNG: if you want long running BUbiNG jobs, you should really set this property : javax.net.ssl.sessionCacheSize to a reasonnable value (I think a few thousands is OK). Otherwise, all your HTTPS queries will fill the java ssl cache which is, by default, unbounded. And although it is supposed to be made of "Soft" references, the Full GC wall his still quickly met.

vigna commented 6 years ago

On the very last issue: thanks for finding this. Are you using something like

-Djavax.net.ssl.sessionCacheSize=8192

?

vigna commented 6 years ago

So, we thought a bit about this issue. The main consideration is that the treatment for "too many visit states" and "too many visit states for an IP" can be done very differently.

The "too many visit states for an IP", if the limit is large enough, can be solved simply by dropping additioanal visit states. If you have, say, 10000 hosts for an IP, even with 0 ipDelay the current architecture will make it simply impossible to visit all of them.

The "too many visit states" problem can be solved, as you were saying, very easily: when a URL is taken from the ready URL queue and it would generate a new visit state (l. 155 of Distributor.java) one checks how many visit states are around, and if they are more than a threshold the URL is put in a queue similar to readyURLs (like, newHostURLs). Now, whenever we would read from the readURL queue (l. 145) we check first whether there is a newHostURL, and use that.

How does that feel?

guillaumepitel commented 6 years ago

Yes, that's exactly what I did (for the javax.net.ssl... and for the newHostURLs)

Considering the "too many hosts per IP" problem. Dropping is certainly a good approximation. Some strange sites have a one-page per host structure which may still make it worth to keep them, but it's certainly a hard problem).

vigna commented 6 years ago

Would you provide a pull request?

guillaumepitel commented 6 years ago

I'm working on it. The solution I've written for now (which is almost exactly what you proposed) brings a new problem. The ByteArrayDiskQueue is never completely emptied (unlik receivedURLs and readyURLs), and as a result, the file it uses grows continuously, even though the size of the queue itself is quite stable.

A possibility would be to drain the "safety valve" queue to the readyURLs queue when flushing the sieve (at this point, the readyURLs queue is empty). I'm pretty sure it's not optimal.

Another possibility would be to use a multi-file version of the ByteDiskQueue which would be able to remove old blocks of data.

Or maybe you already have a disk-backed queue structure that can support this use case ? I'm going to test the "draining" method to see how performant it is.