Closed guillaumepitel closed 7 years ago
I assume you successfully run common JGroups troubleshooting procedure
https://docs.jboss.org/jbossas/docs/Clustering_Guide/4/html/ch07s07s11.html
to test the connectivity… I often find that the problem is related to network configurations on the servers/routers/firewalls.
In fact I was rather surprised that multicast on UDP could go on in a cloud so smoothly. I'd first try to make work the JGroups troubleshooting thing. If there's no way, it is possible (just by changing the xml JGroups configuration file) to choose another transport layer.
I don't use the multicast transport, it won't work on Amazon EC2. I use something called S3_PING. But again, the ViewHandler correctly identifies the cluster members. So it seems to be working, except that there is no JobManager other than the local host.
I would like to add some debug in JAI4J, but I can't find the sources. Can you help ?
Are you sure that S3_PING
works for transport and not only discovery? If I remember correctly, JGroups handles finding peers and communicating with them using different protocols. In our setting (that has always been a local cluster), we just used multicast, so I'm not able to help with other protocols.
Currently JAI4J is not available via a publicly accessible repo. If you don't mind sending me an email at santini@di.unimi.it
I'll send you a tar of the sources.
OK, I've found the problem. It was clearly stated in comments in the source code, but not obvious from the overview's config example.
Because I launch a variable number of machines, they all share the same properties file. In which I've never specified the property "name" , which should, as stated in the comment, be unique in the cluster. Because of this, everytime a machine connects, it replaces the previously connected remote job manager, instead of being added to the list.
So, on one hand it's a no-bug, but on the other hand, I think the overview doc should point that, for using BUbiNG in a cluster setup, you have to specify a different name for each machine :
"In the standard BUbiNG setup, agents in the same crawl groups coordinates autonomously using consistent hashing, so if you want to perform a multi-agent crawl you must just be sure to have properly configured your hardware and JGroups so they work together, and give them a different name. A simple way to check that this is true is to start the crawl in pause mode, check from the logs that all agents are visible to each other, and then start the crawl."
It's obvious now, but finding the problem wasn't.
OK, we updated the documentation. It was sort of obvious to us because we come from years of crawling using this kind of design, and the input of people that are new to our crawlers is invaluable to find such omissions in the documentation, so thanks!
I've been trying to use BuBing in a cluster (first in a local network, then on EC2). I'm using jgroups' S3_PING protocol for cluster connection, and the views from the Jgroups messages (actually from the JGroupsJobManager) correctly show all clusters members. However, there is only one JobManager. For a long time I thought it was normal and all was correctly working, but today I realized that the receivedURLs counter stays at 0 and that no Jobs from other agents ever arrives on the nodes.
Here is a log sample from JGroups/JGroupsJobManager with a 2-node clusters (10.42.1.57 and 10.42.1.254)
Any help would be highly appreciated.