coreos / docs

Documentation for CoreOS projects
http://coreos.com/docs
Apache License 2.0
882 stars 532 forks source link

No documentation of EC2 AMI network configuration #472

Closed hapnermw closed 9 years ago

hapnermw commented 9 years ago

CoreOS EC2 AMIs contain specialized network configuration to 'link' to EC2 eth0. There is no documentation describing this and no documentation for how to configure CoreOS networking to support the wide range of EC2 network interface options. I ran into this when attempting to configure CoreOS to use EC2 secondary IPs added to eth0.

This is just one more instance of the fact that CoreOS's documentation of its integration with AWS is not adequate for production use.

The following are notes on what was required to configure an EC2 CoreOS instance on a VPC public subnet to support the two EC2 secondary IPs added to eth0:

The primary IP must be specified as the last Address entry so that CoreOS uses it as its identity when joining the machine to its cluster; otherwise, after the first boot, the identity changes from the EC2 primary IP to the last Address and confuses CoreOS cluster join logic

EC2 VPC DNS defaults to 10.0.0.2 (EC2 does not document this); changing eth0 to add secondary IPs clears the CoreOS EC2 default /etc/resolv.conf configuration so both DNS and Domains must be specified to restore it

EC2 VPC 10.0.6.0/24 subnet default Gateway is 10.0.6.1 and 10.0.6.0 is the Gateway for the subnet (EC2 does not document this); if these are not set, the machine will work correctly on initial creation; however, after a reboot the CoreOS EC2 default route table will not contain these required gateways

jedsmith commented 9 years ago

There is no CoreOS or Amazon problem here, I'm afraid. Let me see if I can help.

CoreOS EC2 AMIs contain specialized network configuration to 'link' to EC2 eth0. There is no documentation describing this

The CoreOS AMI defaults to DHCP over IPv4, as does just about every Linux distribution available on Amazon. There is no specialized network configuration; CoreOS merely configures systemd-networkd to obtain DHCP as a fallback for unconfigured interfaces. You can see this in /usr/lib/systemd/network/zz-default.network on a CoreOS system:

https://github.com/coreos/init/blob/master/systemd/network/zz-default.network

There is no documentation describing this and no documentation for how to configure CoreOS networking to support the wide range of EC2 network interface options.

You use systemd-networkd for this. The only role CoreOS has here is providing you a deployment of systemd. For example, I run several CoreOS machines with both eth0 and eth1 in Amazon, and I wrote systemd units to configure both of those interfaces with static addresses and routes.

Are you saying you have more than one RFC 1918 address on eth0? Why, if so? That's an odd configuration and I've never heard of a need for it before. If you mean a public IP or Elastic IP, you do not bring those up on the interface.

EC2 VPC DNS defaults to 10.0.0.2 (EC2 does not document this);

Because by default, your VPC's DHCP options are configured to serve AmazonProvidedDNS and DHCP configures your resolvers appropriately. It is also not fixed to 10.0.0.2; you picked the default numbering of 10.0.0.0/16 and the resolver is "plus two" of the network. A 10.10.0.0/16 VPC would have DNS at 10.10.0.2.

This is documented by Amazon here, actually.

You can see your VPC's DHCP options here:

changing eth0 to add secondary IPs clears the CoreOS EC2 default /etc/resolv.conf configuration so both DNS and Domains must be specified to restore it

There is no CoreOS EC2 default. Those values are set from DHCP, as shown above. If you switch to a static configuration to add multiple addresses to eth0, you are no longer setting your resolvers from DHCP and need to provide them via other means, which you figured out, as well as your gateway...

EC2 VPC 10.0.6.0/24 subnet default Gateway is 10.0.6.1 and 10.0.6.0 is the Gateway for the subnet (EC2 does not document this); if these are not set, the machine will work correctly on initial creation; however, after a reboot the CoreOS EC2 default route table will not contain these required gateways

...and it stops working because you stopped using DHCP to obtain that information. When you configure alternative network configurations, you must provide this information as it is no longer discovered via DHCP. This isn't specific to CoreOS.

The primary IP must be specified as the last Address entry so that CoreOS uses it as its identity when joining the machine to its cluster;

This is happening because you don't specifically bind etcd to an address. When you have multiple addresses on an interface, behavior starts to get strange and you must be very cautious about to what address you bind a program. Depending on what version of etcd you are running -- I'm betting you run the built-in one -- you can (and really should) specify an exact address to bind and advertise with flags. Otherwise, which outgoing address is selected when etcd comes up comes down to a Linux algorithm, and it isn't picking the one you want.

Check the etcd documentation for how to select an address.

My advice here is to reevaluate why you need multiple addresses on eth0, as it is the source of almost all of your problems. Especially if they are in different subnets, you are going to have a bad time. This is all Linux administration, too, and doesn't speak to CoreOS or Amazon or suitability for a given use.

Hope that helps!

hapnermw commented 9 years ago

If this were 'normal' there would be a networkd config that resulted from AWS/DHCP network setup - there was none. Since I haven't used CoreOS in other environments I don't know how DHCP provided config is reflected to networkd. Possibly the lack of this reflection is a general issue.

If you want your users to be successful in their use of CoreOS on AWS, it would be a good idea to give them a bit more leg-up on bridging the AWS to CoreOS networking gap. I made the changes to AWS network config at the time the instance was created. It is not at all obvious that this configuration will work on the first boot and fail on all subsequent boots.This is not 'normal' linux network config.

This little niggling detail is just one example of CoreOS's casual approach to support of its user on AWS. A bigger hole is the complete lack of VPC related info. There seems to be the assumption that from a CoreOS perspective all AWS instances are the same - they are not. CoreOS may wish to leave this as an exercise for their AWS users; but, as AWS ramps up ECS, CoreOS may find it has less AWS users than it had anticipated.

On Thu, Apr 16, 2015 at 7:10 PM, Jed Smith notifications@github.com wrote:

There is no CoreOS or Amazon problem here, I'm afraid. Let me see if I can help:

There is no documentation describing this and no documentation for how to configure CoreOS networking to support the wide range of EC2 network interface options.

You use systemd-networkd for this. The only role CoreOS has here is providing you a deployment of systemd. For example, I run several CoreOS machines with both eth0 and eth1 in Amazon, and I wrote systemd-networkd units to configure both of those interfaces with static addresses and routes.

Are you saying you have more than one RFC 1918 address on eth0? Why, if so? That's an odd configuration and I've never heard of a need for it before. If you mean a public IP or Elastic IP, you do not bring those up on the interface.

EC2 VPC DNS defaults to 10.0.0.2 (EC2 does not document this);

Because by default, your VPC's DHCP options are configured to serve AmazonProvidedDNS and DHCP configures your resolvers appropriately. It is also not fixed to 10.0.0.2; you picked the default numbering of 10.0.0.0/16 and the resolver is "plus two" of the network. A 10.10.0.0/16 VPC would have DNS at 10.10.0.2.

This is documented by Amazon here, actually. http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/VPC_DHCP_Options.html#AmazonDNS

You can see your VPC's DHCP options here:

https://camo.githubusercontent.com/8035f5767195e1f5977bdca116659a1cdc7db82b/687474703a2f2f692e696d6775722e636f6d2f5334506d3850762e706e67

changing eth0 to add secondary IPs clears the CoreOS EC2 default /etc/resolv.conf configuration so both DNS and Domains must be specified to restore it

There is no CoreOS EC2 default. Those values are set from DHCP, as shown above. If you switch to a static configuration to add multiple addresses to eth0, you are no longer setting your resolvers from DHCP and need to provide them via other means, which you figured out, as well as your gateway...

EC2 VPC 10.0.6.0/24 subnet default Gateway is 10.0.6.1 and 10.0.6.0 is the Gateway for the subnet (EC2 does not document this); if these are not set, the machine will work correctly on initial creation; however, after a reboot the CoreOS EC2 default route table will not contain these required gateways

...and it stops working because you stopped using DHCP to obtain that information. To be clear, CoreOS defaults to DHCP, as do all AMIs on Amazon (not just ours). When you configure alternative network scenarios, you must provide this information as it is no longer discoverable via DHCP. This isn't specific to CoreOS.

The primary IP must be specified as the last Address entry so that CoreOS uses it as its identity when joining the machine to its cluster;

This is happening because you don't specifically bind etcd to an address. When you have multiple addresses on an interface, behavior starts to get strange and you must be very cautious about to what address you bind a program. Depending on what version of etcd you are running -- I'm betting you run the built-in one -- you can (and really should) specify an exact address to bind and advertise with flags. Otherwise, which outgoing address is selected when etcd comes up comes down to a Linux algorithm, and it isn't picking the one you want.

Check the etcd documentation for how to select an address.

My advice here is to reevaluate why you need multiple addresses on eth0, as it is the source of almost all of your problems. Especially if they are in different subnets, you are going to have a bad time. This is all Linux administration, too, and doesn't speak to CoreOS or Amazon.

Hope that helps!

— Reply to this email directly or view it on GitHub https://github.com/coreos/docs/issues/472#issuecomment-93878839.

jedsmith commented 9 years ago

If this were 'normal' there would be a networkd config that resulted from AWS/DHCP network setup - there was none. Since I haven't used CoreOS in other environments I don't know how DHCP provided config is reflected to networkd. Possibly the lack of this reflection is a general issue.

If I'm understanding your issue correctly (and please, correct me if I'm not), it is that you did not find it easy to interrogate what systemd-networkd had obtained via DHCP. Here's how:

core@example ~ $ networkctl status eth0
● 2: eth0
   Link File: /usr/lib64/systemd/network/99-default.link
Network File: /usr/lib64/systemd/network/zz-default.network
        Type: ether
       State: routable (configured)
        Path: xen-vif-0
      Driver: vif
  HW Address: 06:40:50:dd:c1:ad
         MTU: 9001
     Address: 10.0.0.12
              fe80::440:50ff:fedd:c1ad
     Gateway: 10.0.0.1
         DNS: 10.0.0.2
      Domain: us-west-1.compute.internal

networkctl is documented upstream. Again, this is not CoreOS-specific; this is plain systemd-networkd. Arch Linux uses it as well, and has documented it for their users. We have some documentation on using systemd-networkd, with a link through to the full upstream documentation.

A bigger hole is the complete lack of VPC related info.

I sympathize with this to an extent, but bear in mind that a VPC is a network. You're asking an operating system to document network architecture and design. I agree that perhaps a getting started quickly with a VPC guide might be useful, but I'm not sure how much value there would be beyond Amazon's own (extensive) VPC documentation as well as the wizards they offer. To put it another way, CoreOS documenting VPC is roughly analogous to Ubuntu documenting how to architect a datacenter network.

There's just so little of CoreOS that even intersects with VPC, that I'm not sure what we could document that would be very useful in a general sense. I spent a few minutes thinking through such a guide before answering you, and there were so many "if you've chosen this, do this" decision points that generalizing such information is nearly impossible.

These are choices you must make as a network administrator, and CoreOS runs within that network. An area where I feel CoreOS can do very well documenting VPC is how to architect a Tectonic deployment, because then getting the network right is extremely important, and Tectonic is more of a "unit" in a networking sense. The only surface that CoreOS has with a VPC is that it expects to be able to speak IPv4 to other nodes in a CoreOS cluster, and that is general network design.

If you want your users to be successful in their use of CoreOS on AWS, it would be a good idea to give them a bit more leg-up on bridging the AWS to CoreOS networking gap. [...] This little niggling detail is just one example of CoreOS's casual approach to support of its user on AWS.

My role at CoreOS is unique in that I influence the product but my primary task is executing on our infrastructure. I have been described as a user first. In this role, basically my entire job is running CoreOS on Amazon in several different ways. That's why I was asked to respond to you, because this is my primary role and my area of expertise.

I have to say, my experience with CoreOS has been the polar opposite of yours, even with the empowerment I have been given to disagree with product direction. To be clear, there are a few sharp edges that bug me in deployments of CoreOS, but at no time have I ever considered CoreOS as having a "casual approach" to AWS support; Amazon is a first-class platform for CoreOS with AMIs built for every release and extensive documentation specific to Amazon and even ECS. I'm not sure where you're getting this opinion of CoreOS, but it is the complete opposite of my personal experience. I wish I knew how to bridge that gap, and I assure you, yours is a unique opinion in my experience.

I think, in the end, your primary complaints are with systemd and Amazon and they happen to be negatively impacting your CoreOS experience. That is a bummer, and I wish I could help.

hapnermw commented 9 years ago

One of the confusions is that systemd documents its configuration directories to be

/etc/systemd/system/*

/run/systemd/system/*

/usr/lib/systemd/system/*

It doesn't say anything about

/usr/lib64/systemd/*

If I had known it existed, its content may have answered my questions.

I did not use the networkctl status command because I didn't know it existed. The systemd.network docs don't mention networkctl nor does systemd.networkd.service nor does CoreOS intro to networkd.

While I don’t expect CoreOS to document linux internals, giving your users a basic assist on viewing its network config isn’t that much to ask.

The problem with the 'throw up your hands' approach to VPC is that anyone using CoreOS in production is going to be using VPC. The least CoreOS should do is provide an example that illustrates how to set up a multi availability zone cluster in a VPC with public and private subnets. Anything less doesn't make the grade for production. I think you would find much to document if you walked through this. The fact there is no such example says load and clear that CoreOS is not serious about production use on AWS.

I think the CoreOS team has done and is doing a great job with CoreOS - on the other hand, if your users can't deploy and manage a production cluster in a multi zone VPC you have failed.

On Sat, Apr 18, 2015 at 5:50 PM, Jed Smith notifications@github.com wrote:

If this were 'normal' there would be a networkd config that resulted from AWS/DHCP network setup - there was none. Since I haven't used CoreOS in other environments I don't know how DHCP provided config is reflected to networkd. Possibly the lack of this reflection is a general issue.

If I'm understanding your issue correctly (and please, correct me if I'm not), it is that you did not find it easy to interrogate what systemd-networkd had obtained via DHCP. Here's how:

core@example ~ $ networkctl status eth0 ● 2: eth0 Link File: /usr/lib64/systemd/network/99-default.link Network File: /usr/lib64/systemd/network/zz-default.network Type: ether State: routable (configured) Path: xen-vif-0 Driver: vif HW Address: 06:40:50:dd:c1:ad MTU: 9001 Address: 10.0.0.12 fe80::440:50ff:fedd:c1ad Gateway: 10.0.0.1 DNS: 10.0.0.2 Domain: us-west-1.compute.internal

networkctl http://www.freedesktop.org/software/systemd/man/networkctl.html is documented upstream. Again, this is not CoreOS-specific; this is plain systemd-networkd. Arch Linux uses it as well, and has documented it for their users. https://wiki.archlinux.org/index.php/Systemd-networkd We have some documentation https://coreos.com/docs/cluster-management/setup/network-config-with-networkd/ on using systemd-networkd, with a link through to the full upstream documentation.

A bigger hole is the complete lack of VPC related info.

I sympathize with this to an extent, but bear in mind that a VPC is a network. You're asking an operating system to document network architecture and design. I agree that perhaps a getting started quickly with a VPC guide might be useful, but I'm not sure how much value there would be beyond Amazon's own (extensive) VPC documentation as well as the wizards they offer. To put it another way, CoreOS documenting VPC is roughly analogous to Ubuntu documenting how to architect a datacenter network.

There's just so little of CoreOS that even intersects with VPC, that I'm not sure what we could document that would be very useful in a general sense. I spent a few minutes thinking through such a guide before answering you, and there were so many "if you've chosen this, do this" decision points that generalizing such information is nearly impossible.

These are choices you must make as a network administrator, and CoreOS runs within that network. An area where I feel CoreOS can do very well documenting VPC is how to architect a Tectonic deployment, because then getting the network right is extremely important, and Tectonic is more of a "unit" in a networking sense. The only surface that CoreOS has with a VPC is that it expects to be able to speak IPv4 to other nodes in a CoreOS cluster, and that is general network design.

If you want your users to be successful in their use of CoreOS on AWS, it would be a good idea to give them a bit more leg-up on bridging the AWS to CoreOS networking gap. [...] This little niggling detail is just one example of CoreOS's casual approach to support of its user on AWS.

My role at CoreOS is unique in that I influence the product but my primary task is executing on our infrastructure. I have been described as a user first. In this role, basically my entire job is running CoreOS on Amazon in several different ways. That's why I was asked to respond to you, because this is my primary role and my area of expertise.

I have to say, my experience with CoreOS has been the polar opposite of yours, even with the empowerment I have been given to disagree with product direction. To be clear, there are a few sharp edges that bug me in deployments of CoreOS, but at no time have I ever considered CoreOS as having a "casual approach" to AWS support; Amazon is a first-class platform for CoreOS with AMIs built for every release and extensive documentation https://coreos.com/docs/running-coreos/cloud-providers/ec2/ specific to Amazon and even ECS https://coreos.com/docs/running-coreos/cloud-providers/ecs/. I'm not sure where you're getting this opinion of CoreOS, but it is the complete opposite of my personal experience. I wish I knew how to bridge that gap, and I assure you, yours is a unique opinion in my experience.

I think, in the end, your primary complaints are with systemd and Amazon and they happen to be negatively impacting your CoreOS experience. That is a bummer, and I wish I could help.

— Reply to this email directly or view it on GitHub https://github.com/coreos/docs/issues/472#issuecomment-94217099.

willscripted commented 8 years ago

Thanks Hapner for asking the question and @jedsmith for the really thorough overview of private network configuration in linux. You've connected a lot of new foreign concepts for me :+1: :)