Closed guillaumepitel closed 6 years ago
Actually, lowering the IPdelay is probably not enough because of the acquire/release mechanism on workbenchentries, which take at least a page fetching + parsing time.
After a bit of work and testing this week-end, here are my solutions for apparently solving "slowing down" (by the way, I think the problems I had with JGroups were more a symptom than a cause) :
So here is my solution :
For now my implementation of these anti-OOM / GC explosion is not super clean. I would like your opinion on the matter to see if it's worth submitting a PR
Ok, found the last (I hope) memory "leak" problem, and it's not strictly related to BUbiNG: if you want long running BUbiNG jobs, you should really set this property : javax.net.ssl.sessionCacheSize to a reasonnable value (I think a few thousands is OK). Otherwise, all your HTTPS queries will fill the java ssl cache which is, by default, unbounded. And although it is supposed to be made of "Soft" references, the Full GC wall his still quickly met.
On the very last issue: thanks for finding this. Are you using something like
-Djavax.net.ssl.sessionCacheSize=8192
?
So, we thought a bit about this issue. The main consideration is that the treatment for "too many visit states" and "too many visit states for an IP" can be done very differently.
The "too many visit states for an IP", if the limit is large enough, can be solved simply by dropping additioanal visit states. If you have, say, 10000 hosts for an IP, even with 0 ipDelay the current architecture will make it simply impossible to visit all of them.
The "too many visit states" problem can be solved, as you were saying, very easily: when a URL is taken from the ready URL queue and it would generate a new visit state (l. 155 of Distributor.java) one checks how many visit states are around, and if they are more than a threshold the URL is put in a queue similar to readyURLs (like, newHostURLs). Now, whenever we would read from the readURL queue (l. 145) we check first whether there is a newHostURL, and use that.
How does that feel?
Yes, that's exactly what I did (for the javax.net.ssl... and for the newHostURLs)
Considering the "too many hosts per IP" problem. Dropping is certainly a good approximation. Some strange sites have a one-page per host structure which may still make it worth to keep them, but it's certainly a hard problem).
Would you provide a pull request?
I'm working on it. The solution I've written for now (which is almost exactly what you proposed) brings a new problem. The ByteArrayDiskQueue is never completely emptied (unlik receivedURLs and readyURLs), and as a result, the file it uses grows continuously, even though the size of the queue itself is quite stable.
A possibility would be to drain the "safety valve" queue to the readyURLs queue when flushing the sieve (at this point, the readyURLs queue is empty). I'm pretty sure it's not optimal.
Another possibility would be to use a multi-file version of the ByteDiskQueue which would be able to remove old blocks of data.
Or maybe you already have a disk-backed queue structure that can support this use case ? I'm going to test the "draining" method to see how performant it is.
What follows is a bit the opposite of the other issue I raised about enforcing per-IP delay with the per host partitioning... so I guess I changed my mind about the priorities :)
This is more an open discussion than an issue per se, but here is what I have observed : BUbiNG's architecture is based on a WorkbenchEntry-based scheduling, a WorkbenchEntry contains the list of VisitStates related to a given IP address, in order to be able to enforce IP based politeness.
A problem occurs because it turns out that quite a large number of websites resolve to the same address. For instance all hosts in *.blogspot.com resolve to the same address. That turns out to be a problem because the Workbench is filled by these whales that take up all the available space (and the schemeAuthority2VisitState is also filled by them, consequently), so at the end, only a few WorkbenchEntry remain, and the crawl is limited by the per-IP politeness (maybe also by other limiting factors, even though I don't think the acquire/release mechanism is a problem here).
Basically, it boils down to this : as time pass, the crawl will be slower and slower because of the workbench size limit and those huge workbench entries.
I see three possible solutions :
I believe only the first option actually works. I've found conflicting opinions on the matter, but having a strong per-IP politeness seems not to be the most commonly accepted rule. So I guess we could replace the IP politeness with a IP-politeness powerfactor alpha. alpha=1 is equivalent to a zero IP delay, alpha=0 is the opposite, with IP delay = host delay.