LAW-Unimi / BUbiNG

The LAW next generation crawler.
http://law.di.unimi.it/software.php#bubing
Apache License 2.0
85 stars 24 forks source link

Distribution not working as expected #10

Closed guillaumepitel closed 7 years ago

guillaumepitel commented 7 years ago

I've been trying to use BuBing in a cluster (first in a local network, then on EC2). I'm using jgroups' S3_PING protocol for cluster connection, and the views from the Jgroups messages (actually from the JGroupsJobManager) correctly show all clusters members. However, there is only one JobManager. For a long time I thought it was normal and all was correctly working, but today I realized that the receivedURLs counter stays at 0 and that no Jobs from other agents ever arrives on the nodes.

Here is a log sample from JGroups/JGroupsJobManager with a 2-node clusters (10.42.1.57 and 10.42.1.254)

Any help would be highly appreciated.

2017-10-05 21:02:26,084 7933 WARN [main] o.j.p.p.FLUSH - agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi: waiting for UNBLOCK timed out after 2000 ms
2017-10-05 21:02:26,084 7933 INFO [main] i.u.d.j.j.JGroupsJobManager - Currently knowing 1 job managers (1 alive)
2017-10-05 21:02:26,084 7933 DEBUG [main] i.u.d.j.j.JGroupsJobManager - Currently known remote job managers: {[agent (address=agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi, weight=1, suspected=false, disabled=false, pendingMessages=0)]}
2017-10-05 21:02:26,084 7933 DEBUG [main] i.u.d.j.j.JGroupsJobManager - Assignment strategy: [[agent (address=agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi, weight=1, suspected=false, disabled=false, pendingMessages=0)]]
2017-10-05 21:02:27,723 9572 INFO [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - New JGroups view [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi|1] (2) [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi, agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi]
2017-10-05 21:02:27,723 9572 INFO [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - New members: [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi]
2017-10-05 21:02:27,723 9572 INFO [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - Currently knowing 1 job managers (1 alive)
2017-10-05 21:02:27,723 9572 DEBUG [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - Currently known remote job managers: {[agent (address=agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi, weight=1, suspected=false, disabled=false, pendingMessages=0)]}
2017-10-05 21:02:27,725 9574 DEBUG [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - Assignment strategy: [[agent (address=agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi, weight=1, suspected=false, disabled=false, pendingMessages=0)]]
2017-10-05 21:02:27,725 9574 DEBUG [ViewHandler,debug,agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi] i.u.d.j.j.JGroupsJobManager - Current JGroups view: [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi|1] (2) [agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.57:9999/jmxrmi, agent{1}@service:jmx:rmi:///jndi/rmi://10.42.1.254:9999/jmxrmi]
mapio commented 7 years ago

I assume you successfully run common JGroups troubleshooting procedure

https://docs.jboss.org/jbossas/docs/Clustering_Guide/4/html/ch07s07s11.html

to test the connectivity… I often find that the problem is related to network configurations on the servers/routers/firewalls.

vigna commented 7 years ago

In fact I was rather surprised that multicast on UDP could go on in a cloud so smoothly. I'd first try to make work the JGroups troubleshooting thing. If there's no way, it is possible (just by changing the xml JGroups configuration file) to choose another transport layer.

guillaumepitel commented 7 years ago

I don't use the multicast transport, it won't work on Amazon EC2. I use something called S3_PING. But again, the ViewHandler correctly identifies the cluster members. So it seems to be working, except that there is no JobManager other than the local host.

guillaumepitel commented 7 years ago

I would like to add some debug in JAI4J, but I can't find the sources. Can you help ?

mapio commented 7 years ago

Are you sure that S3_PING works for transport and not only discovery? If I remember correctly, JGroups handles finding peers and communicating with them using different protocols. In our setting (that has always been a local cluster), we just used multicast, so I'm not able to help with other protocols.

mapio commented 7 years ago

Currently JAI4J is not available via a publicly accessible repo. If you don't mind sending me an email at santini@di.unimi.it I'll send you a tar of the sources.

guillaumepitel commented 7 years ago

OK, I've found the problem. It was clearly stated in comments in the source code, but not obvious from the overview's config example.

Because I launch a variable number of machines, they all share the same properties file. In which I've never specified the property "name" , which should, as stated in the comment, be unique in the cluster. Because of this, everytime a machine connects, it replaces the previously connected remote job manager, instead of being added to the list.

So, on one hand it's a no-bug, but on the other hand, I think the overview doc should point that, for using BUbiNG in a cluster setup, you have to specify a different name for each machine :

"In the standard BUbiNG setup, agents in the same crawl groups coordinates autonomously using consistent hashing, so if you want to perform a multi-agent crawl you must just be sure to have properly configured your hardware and JGroups so they work together, and give them a different name. A simple way to check that this is true is to start the crawl in pause mode, check from the logs that all agents are visible to each other, and then start the crawl."

It's obvious now, but finding the problem wasn't.

vigna commented 6 years ago

OK, we updated the documentation. It was sort of obvious to us because we come from years of crawling using this kind of design, and the input of people that are new to our crawlers is invaluable to find such omissions in the documentation, so thanks!