confidential-containers / cloud-api-adaptor

Ability to create Kata pods using cloud provider APIs aka the peer-pods approach
Apache License 2.0
48 stars 88 forks source link

Add support to use public IP for the pod VM in Azure #2035

Closed bpradipt closed 2 months ago

bpradipt commented 2 months ago

This is useful for environments where the K8s cluster is run on a developer workstation and peer pod is created in Azure. Useful for testing, working on AI models requiring large VMs etc. Similar functionality also exists for the AWS provider. So this PR also brings functional parity.

bpradipt commented 2 months ago

I was wondering whether we could have those resources being created (and deleted) implicitly in a createVM call instead of explicitly creating and deleting them (using a VirtualMachinePublicIPAddressConfiguration).

See this code sample: https://github.com/mkulke/mkosi-playground/blob/e7bdeed71f8a3820fa265bf1ca74c7fec2e0e6cb/launch-vm/main.go#L158-L176

I'm not sure why we weren't doing that for the NIC in the first place, so I need to test that. If it works like this we wouldn't have to manage the lifecycle of public ips and nics manually, which can be prone to race conditions.

Makes sense. On the public IP front I tried to do something similar but it didn't work for me.

mkulke commented 2 months ago

I was wondering whether we could have those resources being created (and deleted) implicitly in a createVM call instead of explicitly creating and deleting them (using a VirtualMachinePublicIPAddressConfiguration). See this code sample: https://github.com/mkulke/mkosi-playground/blob/e7bdeed71f8a3820fa265bf1ca74c7fec2e0e6cb/launch-vm/main.go#L158-L176 I'm not sure why we weren't doing that for the NIC in the first place, so I need to test that. If it works like this we wouldn't have to manage the lifecycle of public ips and nics manually, which can be prone to race conditions.

Makes sense. On the public IP front I tried to do something similar but it didn't work for me.

i see, let me try that. it shouldn't be too hard to convert the code to implicit nic creation.

mkulke commented 2 months ago

I have played around with implicit creation of nics, it seems to work for me, at least I didn't encounter problems after some casual testing: https://github.com/confidential-containers/cloud-api-adaptor/compare/main...mkulke:cloud-api-adaptor:mkulke/az-use-implicit-nic-creation?expand=1

bpradipt commented 2 months ago

@mkulke thanks, let me try with your changes.

mkulke commented 2 months ago

@mkulke thanks, let me try with your changes.

yeah, that would be interesting. I'm currently observing network problems after more thorough testing. I can't really explain that yet, since the infra looks similar when created implicitly. that's pretty curious, and it would be good to get to the bottom of this problem. but that might not be trivial, and if it's urgent we could consider merging the explicit management of IP addresses in this PR.

there is a similar resource leak risk as it exists for NICs currently, but since this setting is off-by-default and should not be turned on casually, it might be tolerable.

bpradipt commented 2 months ago

@mkulke my initial test using your code resulted in the following error. I cherry-picked the changes on top of 0.9.0 to work on my current setup.

RESPONSE 400: 400 Bad Request
ERROR CODE: OutboundConnectivityNotEnabledOnVM
--------------------------------------------------------------------------------
{
  "error": {
    "details": [],
    "code": "OutboundConnectivityNotEnabledOnVM",
    "message": "No outbound connectivity configured for virtual machine /subscriptions/***/resourceGroups/aro-mz4v0ygu/providers/Microsoft.Compute/virtualMachines/podvm-app-47cb4c3b. Please attach standard load balancer or public IP address to VM, create NAT gateway or configure user-defined routes (UDR) in the subnet. Learn more at aka.ms/defaultoutboundaccess."
  }
}

Is this what you are seeing? I'm yet to debug it though

mkulke commented 2 months ago

@mkulke my initial test using your code resulted in the following error. I cherry-picked the changes on top of 0.9.0 to work on my current setup.

RESPONSE 400: 400 Bad Request
ERROR CODE: OutboundConnectivityNotEnabledOnVM
--------------------------------------------------------------------------------
{
  "error": {
    "details": [],
    "code": "OutboundConnectivityNotEnabledOnVM",
    "message": "No outbound connectivity configured for virtual machine /subscriptions/***/resourceGroups/aro-mz4v0ygu/providers/Microsoft.Compute/virtualMachines/podvm-app-47cb4c3b. Please attach standard load balancer or public IP address to VM, create NAT gateway or configure user-defined routes (UDR) in the subnet. Learn more at aka.ms/defaultoutboundaccess."
  }
}

Is this what you are seeing? I'm yet to debug it though

hmm no, that's not what I am seeing. my vm is created successfully, but there are network connectivity issues once the vm is created (it works initially, but fails during image-pull). the error is interesting, though.

mkulke commented 2 months ago

"No outbound connectivity configured for virtual machine /subscriptions/***/resourceGroups/aro-mz4v0ygu/providers/Microsoft.Compute/virtualMachines/podvm-app-47cb4c3b. Please attach standard load balancer or public IP address to VM, create NAT gateway or configure user-defined routes (UDR) in the subnet. Learn more at aka.ms/defaultoutboundaccess."

I tried to reproduce that error. if the vnet we are attaching the vm to does not have a NAT gw, it will refuse to start the vm. this is expected behaviour. see this link

mkulke commented 2 months ago

I'm currently observing network problems after more thorough testing. I can't really explain that yet, since the infra looks similar when created implicitly. that's pretty curious, and it would be good to get to the bottom of this problem.

I found the origin of my issues. it turned out to be specific to the network I was testing in. the implicitly created NICs were subject to outbound traffic restrictions, while the explicitly created NICs were not.

mkulke commented 2 months ago

@bpradipt I pushed some changes to that branch, there's a commit that (always) adds a public IP (the above error should be gone, even if you have no NAT gw on your subnet) please test if you have time. If that works for you, I'd open a discrete PR with the implicit-NIC-creation and we can base a -use-public-ip toggle on that branch. that would get rid of a lot of brittle cleanup/management logic.

mkulke commented 2 months ago

ERROR CODE: OutboundConnectivityNotEnabledOnVM

I dove a bit more into this, it turns out this is actually expected, albeit a bit surprising. The reason we are having outbound connectivity atm is because when we create a NIC and a VM seperately the NIC will get "default outbound access" via a transparent public ip. This will not be the case if we crate the NIC as part of a VM.

Implicitly assigning a public IP is not great security-wise and hence this behaviour is being retired. So, either way we have to make sure that podvms will be able to pull images from the internet (or not - depending on whether a user would want that in their deployment) by using explicit network configuration

bpradipt commented 2 months ago

Superseded by https://github.com/confidential-containers/cloud-api-adaptor/pull/2056