Closed Lsquared13 closed 6 years ago
Just suggesting something here. Instead of fixed IP, the nodes could be using dynamic DNS, so that nodes register their DNS name with a dynamic DNS server, so even if they have different IPs, they could still be contacted by their DNS name. Would that help?
Thanks for the input! Our original issue with that is that geth can't handle enode addresses involving DNS names. However, we talked about it today and @john-osullivan thinks it will be reasonable to extend geth to handle DNS bootnodes.
Dynamic DNS server might be something to think about, it sounds like it could be cheaper. John and I were just thinking of throwing an ELB in front of each bootnode. I'm generally in favor of pushing as much load onto AWS as possible, even if it is throwing money at the problem. With that in mind, if Dynamic DNS necessarily means maintaining our own DNS server I might prefer just going with the ELB solution.
For the sake of documentation, here's a thorough write-up of what I've learned about the problem and how to tackle it based on my conversations with Louis & Juan yesterday. I'm still cleaning up on the plaintext passwords thing, but I think I've got a handle on this.
There seems to be one key unresolved question here: do running nodes need an updated list of bootnodes? Juan thought they could discover all new peers from each other, so the list only needs to be valid at start time, but Louis thought they always need a list of active bootnodes. The latter requirement makes the problem a good bit harder, as we need to signal active nodes that they need to update the bootnodes in their supervisor config (see bottom). Assuming the list only needs to be accurate at the start, here are a couple of approaches for ensuring that the list is updated each time a new bootnode is added:
init-quorum
recovery case using the new IP, then saving it to Vault/GitHub for other nodes later ongeth
resolve those addresses.The former approach would mostly involve modifying that recovery case of the init-quorum
script, although we would need the script to get write auth for the right Vault/GitHub endpoints. The latter approach means the addresses stored in Vault/GitHub should never have to change -- this option seems more promising.
The required geth modifications look pretty straightforward. Those enode URLs (enode://[hexUsername]@[IP address]:[TCP port]?discport=[UDP port]
) strictly use IP addresses, but like Louis said, it should just be a couple lines in the CLI parser to resolve DNS addresses down to IPs.
That reduces the problem to making sure that we recover the node to the same DNS address each time. There are a few ways to do that, sounds like an ELB would be one easy way. I still haven't done much research on that side of it, not sure what the proper solution is. I heard something about Auto-Scaling Groups, that sounds like fun.
Appendix re: Live Updates
Louis and I explored how we could tell live nodes that they need to update -- one interesting solution is creating a FRESH_BOOTNODES
ethereum event, then have these nodes run a Python script which listens for it (like we have for block metrics). Receiving this event would trigger a failover process that fetches new bootnode addresses and pauses geth (like here) while we rewrite the supervisor config. If we did want to do that, we'd want to consider building that event straight into the governance -- we don't want random people saying there are new bootnodes, we could ensure that only one of our validator nodes is allowed to emit the event.
How many bootnodes are there?
Asking this because my software upgrade mechanism plans to bring down nodes in other to upgrade software and if a particular node is down, bootnodes shouldn't tell other nodes the downed node is available.
I believe there are 3 bootnodes for each of the 14 supported regions, so 42.
As far as I know, we don't have an easy way to detect when a node is going down. The failure event might be sudden, so we might not be able to call some graceful exit procedure.
Louis' advice to me was focus on when nodes are being turned on, and then detecting whether they're new nodes or replacement ones. We've already got a running process to hook into on the bootup, so it saves us some headaches of determining whether a network participant is really down or not.
EDIT: @Lsquared13 & @eximchain137 (Juan?) can comment more on this, but I believe bootnodes just advertise the list of peers which they're currently connected to. If you kill a node and it stops being connected anywhere, that might automatically solve the advertising problem.
@EximChua - You don't need to worry about bootnodes pointing to dead instances, they may have a best-effort shutdown, but in general, nodes can die without telling the bootnode. The technical basis for this was a public network, so it had to handle such things gracefully.
@Lsquared13 Thanks for the clarification, Louis!
The solution is starting to get clearer here. We need to replace each of our 42 bootnode instance with a Load Balancer + AutoScaling Group. The ASG will let us say, "Make sure there's always an instance here", without having to worry about actually replacing it ourselves. The LB will give us a static IP address which is attached to the ASG, so all of the failover work happens automagically. One happy consequence of this strategy is that we don't need to break the enode protocol, as we'll have a static IP dedicated to the bootnode (or really to the current instance acting as a bootnode).
There are 3 types of load balancer (application, network, & classic), we want to use the faster, lower-level network balancer which gives us the static IP. Terraform's documentation for the network load balancer and the autoscaling group are pretty good, still getting acquainted with how we specify everything.
One wrinkle is that the IP is now known by the the LB, rather than by the instance. The LB needs to somehow tell the instance what address it's sitting behind. One good thing is that the bootnode doesn't need to know that IP value until it wants to start advertising its enode address, so we might be able to turn on the bootnode and have it fetch that value before actually initializing geth
.
One way to potentially make this design a little cheaper is to use one LB per region which has at least three availability zones, then tie each AZ to a bootnode. Technically, the network LB gives you one static IP per each AZ. If we wanted to save on LB's, we could do some fancy work that only initializes a second one if there aren't three AZs in the given region. That said, it does introduce some complication to the process, having to track which ASG is tied to which AZ and preserving all of those hookups within one LB -- we should quantify the cost of this LB+ASG per each bootnode strategy and see how much we'd really save by reducing our LB count.
All that aside, I'm still cleaning up my big update PR -- just wanted to document our conversation from yesterday in someplace better than a Sublime note
That PR got merged in on Tuesday, resolving this issue is now my top priority. Working on it in the lb-asg-bootnodes branch of my fork.
John, if the IP is that of the LB and not the instance, my upgrader would have trouble upgrading the software. Would you be able to provide the IP, or DNS name for the instance?
Glad you pointed the issue out! The question is when & how you want to retrieve the instance's IP. The instance will eventually be replaced over time by the autoscaling group, so we need the LB to give us a static IP.
I don't know exactly how, but we might get the IP of the currently available instance by calling the AWS API from your updater at runtime.
How does this load balancer break the updater? Louis might also have more insight than me into how LBs, ASGs, and EC2 instances are all interacting here
On Mon, Jul 9, 2018, 22:54 EximChua notifications@github.com wrote:
John, if the IP is that of the LB and not the instance, my upgrader would have trouble upgrading the software. Would you be able to provide the IP, or DNS name for the instance?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Eximchain/terraform-aws-quorum-cluster/issues/5#issuecomment-403683291, or mute the thread https://github.com/notifications/unsubscribe-auth/ADpNtdgr9f0Bl8wpHAKqCOGAt905_ispks5uFBdJgaJpZM4RjY5p .
The upgrader uses either IP address or DNS to locate & update the software on the machines.
It uses SSH to connect to the target machines, and then *nix Shell Copy (scp) to transfer files.
If there's a LB behind 2 or more nodes, then the upgrader can only update the node the LB is redirecting to. The other nodes wouldn't be updated until the LB redirects to them.
I'm not actually convinced there's a problem here. Is there a reason we can't use the LB DNS/IP within the network for making connections and the direct DNS privately for doing updates?
Also @john-osullivan can we just have terraform + user-data fill the Load Balancer IP or DNS into a data file? I don't think that would force a circular dependency...
@Lsquared13 Yup, I'm getting the load balancer's DNS by writing it to a data file. The DNS is available as an attribute of aws_lb
, but the IP isn't -- that's why I'm writing an additional script to resolve the DNS to an IP in init-bootnode.sh
.
Also, @EximChua , we might be alright here because we're designing the system such that each load balancer gets its own node. Every bootnode will get one load balancer which points at one autoscaling group, and the autoscaling group has size one. We aren't actually trying to balance load across many machines, just want to ensure that we have a static IP which will always be pointing to some machine.
If you get the guarantee that each LB DNS only points to one machine, does that solve your problem? Note that the specific machine might change over time as dead instances are replaced, so we definitely need that verification code which checks whether a machine has gotten an upgrade.
I'm totally okay with guaranteeing that we have at most one machine per load balancer.
@john-osullivan I'm wondering if maybe there's an AWS CLI call you can make that will get the IPs for a load balancer. We can grant permission to call it to the IAM role the instances use.
We can also try a workaround like this issue suggests.
And regarding the update mechanism, I expect this to reduce to the general problem of making sure replaced instances run the right versions. We definitely do need that, but I also think that will cover us
@john-osullivan If there is a guarantee that each LB DNS points to one machine (and there are no other machines that needs to be updated), then there's no problem.
Based on some further research that happened yesterday, I'm now swapping out the LB in this solution for an elastic IP address.
It turns out that none of the load balancer options support UDP, which is required for communication between nodes. That's a hard blocker, and this six year old issue has a direct response from an Amazon rep saying that ELB does not support UDP. It seems like there's probably a technical reason under the hood, rather than just time constraints, as people have left comments requesting it as recently as October 2017 to no avail.
Did some research, and the happy outcome is that using elastic IPs ends up being a cleaner solution. We don't have to spin up as many resources, and the security group rules don't need to be duplicated as LB listeners. Here's the rundown:
user_data
script which includes the public IP and its allocation ID.init-bootnode.sh
, the following line (taken from this StackOverflow question) will connect the new node to its EIP. The --allow-reassociation
option ensures that when a new node gets spun up later and runs the same command, it is allowed to claim the EIP.aws ec2 associate-address --instance-id $INSTANCE_ID --allocation-id $EIP_ID --allow-reassociation
One constraint on this solution is that by default, AWS only lets you reserve 5 elastic IPs per region. This doesn't get in our way, as we only want 3 bootnodes in each region, but if somebody configures a network with >5 bootnodes in a region, they'll run into issues and have to directly request more from Amazon.
To remedy this issue, I'm making the elastic IP functionality toggled by a boolean variable which defaults to false. If endusers don't use EIPs, then they'll need to figure out their own strategies for updating bootnode addresses, but that's acceptable. I'll make sure to describe the behavior in some documentation somewhere.
This was covered on the morning call, just want to document the strategy for future reference.
This issue is being wrapped up now over in #29
Merged in #29
Problem
When specifying bootnodes, quorum nodes use an enode address which includes the IP address of the bootnode. If a bootnode crashes and is replaced, the IP address will change, and an already active node will have no way of updating their bootnodes.
Potential Solutions