Netflix / edda

AWS API Read Cache
Apache License 2.0
568 stars 134 forks source link

Issue with leader election with two nodes #41

Closed rayrod2030 closed 8 years ago

rayrod2030 commented 11 years ago

I have been trying to run a two node edda cluster with one node serving as the elected leader and acting as the primary AWS crawler and the other node acting as just another powerless citizen merely serving API requests. This setup worked well for about 35 minutes as I monitored both CPU usage and leader election results in the logs of both nodes. However after about 35 to 40 minutes both nodes go into a state where they both think they are leader (split brain sort of), both nodes start to crawl AWS and both nodes (m1.mediums) java process pin the cpu to 100%.

I'm just wondering if anyone else has run into a similar issue running an edda cluster. Would a three node cluster be a better fit for the leader election algorithm used? As an alternative is there any way to force a node to operate only as an api node? Would you say doing leader election more often is better?

coryb commented 11 years ago

Hi Ray,

That is unusual, I have not see it where there are multiple leaders. My guess is that you are running out of memory and the GC is pinning the CPU. At that point I can sort of see multiple leaders since neither would be effectively claiming leadership, so both would assume the other is dead and try to take over. Adding more instances should not effect the election process. All instances race to claim leadership via MongoDB atomic writes, only one wins. The leader typically updates a timestamp every 10s, after 30s and no update the other instances race again to take leadership.

There are currently no options to force leader or to force non-leader.

I would assume that m1.mediums would be OK for most people at 3.5G ram. Are you allocating most of the RAM to tomcat/jetty via -Xmx java options?

I also recently found problem in Edda that can cause a memory leak, so this might be what you are running into. I will try to get the fix pushed up to github soon, I have been tinkering with it for about a week at Netflix and it seems better here.

-Cory

rayrod2030 commented 11 years ago

I'm currently setting xmx and xms at 2048 for an m1.medium. I haven't done jvm tuning in a while so I'm not sure the xmx == xms pattern is still the right way to go. I'll try bumping both nodes up to 3.5G for just xmx and see what happens. I've also set the leader election timings down to 10 seconds. Might be a bit chatty but we'll see how it goes. I think you are right though that the leader eventually gets so bogged down due to memory or cpu constraints that it fails to update mongo for leader election or query mongo thus this sort of split brain thing happening.

We are processing about 7 accounts worth of resources but only two of those accounts have many resources to crawl so I highly doubt our resource load can even come close to the amount of nodes Netflix crawls. Will report with my findings.

rayrod2030 commented 11 years ago

Ok so it looks like upping the memory on both instances has resulted in much more reliable leader election and failover. The one behavior I do notice from running a cluster of two m1.mediums is that when a new leader is elected the change in workload from being merely an api node to now being the primary AWS crawling node seems to keep the leader election thread from kicking in on the new leader node from 60 to 80 seconds. I imaging this temporary delay in cementing leadership could cause some flapping back and forth between nodes if there are other nodes available to grab the leader role. I wonder if there is any way to prioritize the leader election thread to prevent the delay in new leader nodes from pinging mongo as leader. Either way I know leader election is a PIA and this is a good implementation as long as the jvm is tuned properly. Thanks for your help with this!