Floating IP handling for OpenStack may be incorrect

nuwang commented 6 years ago

It looks like our assumption of taking the first available external network and connecting routers to that network does not work in NeCTAR. This is because, when floating ips are created, they are associated with a specific external network, as shown here:

+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| created_at          | 2017-07-27T11:44:07Z                 |
| description         |                                      |
| fixed_ip_address    | None                                 |
| floating_ip_address | 203.100.30.26                        |
| floating_network_id | 058b38de-830a-46ab-9d95-7a614cb06f1b |
| id                  | 00fb871e-3dfc-404a-ae54-124ee70ccd24 |
| name                | 203.100.30.26                        |
| port_id             | None                                 |
| project_id          | 5a0ec9b9aa86427081b6643bf57ad926     |
| revision_number     | 1                                    |
| router_id           | None                                 |
| status              | ACTIVE                               |
| updated_at          | 2017-07-27T11:44:07Z                 |
+---------------------+--------------------------------------+
(openstack-tools)nuwans-mbp-2:openstack-tools Nuwan$ openstack floating ip show 3fd5b8f9-86bf-4e72-b06f-5c96b826b3c1
+---------------------+--------------------------------------+
| Field               | Value                                |
+---------------------+--------------------------------------+
| created_at          | 2017-07-27T11:45:39Z                 |
| description         |                                      |
| fixed_ip_address    | 10.0.0.12                            |
| floating_ip_address | 115.146.80.149                       |
| floating_network_id | e48bdd06-cc3e-46e1-b7ea-64af43c74ef8 |
| id                  | 3fd5b8f9-86bf-4e72-b06f-5c96b826b3c1 |
| name                | 115.146.80.149                       |
| port_id             | 479825c0-66b8-4fa5-99f5-9cb5b540ab20 |
| project_id          | 5a0ec9b9aa86427081b6643bf57ad926     |
| revision_number     | 2                                    |
| router_id           | 9cb527b3-f340-4715-921b-f5fadd163fd2 |
| status              | ACTIVE                               |
| updated_at          | 2017-11-09T08:55:44Z                 |
+---------------------+--------------------------------------+

The floating_network_id appears to be the id of the external network. Therefore, if the external network of the router, and the external network of the floating ip do not match, the launch fails with the following error:

Task failed: Unable to associate floating IP 203.100.30.26 to fixed IP 10.0.0.12 for instance f6adfada-247a-401e-a2b2-75a2b51b4861. Error: External network 058b38de-830a-46ab-9d95-7a614cb06f1b is not reachable from subnet 214fc6b9-788d-4f3b-9ba0-1d002890e7d2.  Therefore, cannot associate Port 479825c0-66b8-4fa5-99f5-9cb5b540ab20 with a Floating IP.

It looks like maybe the solution is to move floating ips under the internet gateway?

nuwang commented 6 years ago

Another solution may be to connect all external networks to each subnet. This means that a router has to be created for each external network and attached to the subnet.

afgane commented 6 years ago

To be clear, by move under the internet gateway, you mean provider.networking.gateways.<gateway>.floating_ips.*? Could we instead require a network when creating a floating IP and ignore it with the other providers (no other provider requires the network)?

nuwang commented 6 years ago

I'm not sure we can support a network parameter consistently across providers. That's because the only valid networks that openstack will accept are external networks.

afgane commented 6 years ago

Sure but we have the .external property on the networks and it would need to be clear in the docs that the supplied param needs to be an external network. We can do a check and raise an exception of that's not the case.

nuwang commented 6 years ago

Right, we could probably do that. However, I'm not sure that's desirable - the main reason being - that's another concept to learn, filter by etc whereas we've already put in some effort to hide the "external" property because it's an openstack only concept. I think it would be great if we could structurally enable the right behaviour, so that when you do create a floating ip, you'd be naturally inclined to attach it via the correct network. Since an openstack gateway == external network, that almost seems like the right place. I guess we should have have a more in-depth design discussion to weigh the pros and cons.

afgane commented 6 years ago

Spent a good chunk of time on this and best way for me to reason about it was to run through a typical usage scenario (basically, the one documented here: http://cloudbridge.cloudve.org/en/latest/topics/networking.html (under step 2)

Create network netX
Create subnet snX
Create an instance within snX (netX)
Create a router for netX
1. AWS: route table associated with netX
2. OS: router where the net is ignored
Attach snX to the router
1. AWS: create subnet snX association with the route table, which implicitly removes the subnet from the Main route table for netX
2. OS: associate snX w/ the router
Get an inet gtw for netX
1. AWS: create internet gateway, attach it to netX
2. OS: netX param is ignored; return the first external network, netY (this cannot be netX because netX is private while the inet gtw must be attached to an external network). See comment below in Step 8b → just choosing the first external network will not work here because the value here is just returned but not actually used until step 7b and at that point we may get Bad router request: Incompatible network. exception if that first network is not suitable. The network object does not contain any (obvious) way to discern which network is allowed to be used here but we should somehow be able to figure out which network is the default.
Attach router to gtw
1. AWS: already attached to netX, adds route to inet gtw
2. OS: attach to gtw, netY
Create a floating IP
1. AWS: no association with a network
2. OS: must be associated with a network and the first external network is assumed, which may be netZ; however, for the routing to work, it must be on the same network as the router, netY. Possibly an additional problem arises here with NeCTAR where the external property is set on most of the networks when queried via the API, yet when using the dashboard, only 3 networks are presented as viable options.

To have this work consistently on OS, we’d need to parameterize step 6 (inet gtw) with an external network and use that same network in step 8 (fip). However, this external network is different from the private netX initially created and used on AWS across the board so it would require the user to be aware of this difference and take the additional step, which introduces cloud-specific code.

afgane commented 6 years ago

After much debating, the conclusion is to stick with the initial suggestion and move floating IPs under the internet gateway. It conceptually does make sense but requires the user to become aware of an additional concept, which is a drawback.

afgane commented 6 years ago

Given we're now parameterizing gateways with a network, would it make sense to move gateways (and hence FIPs) under network? This came up as I tried to implement this new structure in the CloudLaunch API as getting a gateway (or a FIP) now needs a network. WIP branch available here https://github.com/CloudVE/djcloudbridge/tree/networking

nuwang commented 6 years ago

I can't quite remember why we need to parameterize the internet gateway with the network again. Is it to help with attaching to the router? However, if we do need to do this, I see what you mean in terms of it being necessary to implement the cloudlaunch endpoint. In which case, we probably should?

afgane commented 6 years ago

I believe the reason we decided to parameterize the gateway with a network was because of OpenStack so that when a floating IP is being created, it is created on the same network as the one the gateway is connected to, which is required with OpenStack. Without parameterizing it, the user would need to make sure the two networks are the same. The initial implementation assumed there is only one external network and used it but we discovered that doesn't work on NeCTAR in particular. The following is the usage pattern:

net = networking.networks.create(...)
subnet = net.create(...)
vm = instances.create(subnet, ...)
router = networking.routers.create(net, ...)
router.attach_subnet(subnet)
gtw = networking.gateways.get_or_create(net, ...)
fip = gtw.create()
vm.add_floating_ip(fip)

This now ensures the gtw is attached to the same as the network fip is being created under. If we don't supply a network to the gateway, the logic needs to infer automatically which network to use. Having gone though this now, it feels like we can omit the network as long as the FIP is nested under a gateway so it can match it. This will require an explicit attaching of a router to a network, which we initially used, so the logic would become:

...
gtw = networking.gateways.get_or_create()
router.attach_gateway(gtw)  # This step is new from above as it was automatic there
fip = gtw.create()
vm.add_floating_ip(fip)

Would this create issues down the line though, for scenarios where a FIP/gateway wants to be reused (vs. being created and used right away)? Something like this:

subnet = get_subnet...
vm = create(subnet, ...)
gtw = networking.gateways.get_or_create()
fip = gtw.floating_ips.list()[0]
vm.add_floating_ip(fip)

Wouldn't this cause an issue if the returned gateway is not associated with the same network as the subnet used to launch the VM?

One problem that arrises if we nest gateways under the network is how do we retrieve gateways that are not attached to a network? Subnets, as an example of a resource that exists under the network, require (at the provider level) to be parameterized/created under a network. Gateways do not so we could have orphan gateways that we cannot access.

afgane commented 6 years ago

After a bit more discussion, we realized why the network parameter is necessary for the gateway and it's because of AWS. Without it, the notion of get_or_create_inet_gateway had no way to 'discover' an existing internet gateway that's properly connected. So short of picking a random one, we need the network to be able to filter the internet gateways for it vs. always creating a new one (as we did initially). Creating a new one each time was not sustainable given a long running scenario (e.g., launch an instance via CloudLaunch and keep it alive for days) would not have a way to get back the launch context and cleanup (e.g., the instance is manually deleted vs. via CloudLaunch).

The conclusion is to nest the gateways under network. This will have the side effect of not showing gateways that are not connected to a network (for AWS) but at least CloudBridge's implementation will attach the gateway to a network as soon as a gateway is created so only externally-created gateways that are not attached will be omitted. In the future, if deemed desirable, we can also add a gateways property to networking that would list all gateways irrespective of a network.

nuwang commented 6 years ago

Related issue reported on launchpad in: https://bugs.launchpad.net/neutron/+bug/1743480

nuwang commented 6 years ago

A workaround for the OpenStack bug above have been made in: https://github.com/gvlproject/cloudbridge/commit/7688d283fd401857fb7449c7dadc118a19d915aa and https://github.com/gvlproject/cloudbridge/commit/879117a2a123e79623e26e7da2833c806e7381fb

I think this issue can be closed now.

CloudVE / cloudbridge

Floating IP handling for OpenStack may be incorrect #104