check if allocated dedicated host is available (with timeout) - fixes #216

hidalgopl commented 3 years ago

There are two issues that this should fix:

If there's no dedicated host available, kip tried to allocate a new one and run an instance on it without check if it's available.
If there was an available host but fully occupied, kip tried to run instance on it, as it was checking only state. Should be fixed now.

I'm still getting

kip-provider-0 kip I0126 14:51:01.640099       1 node_controller.go:327] Unhealthy wait for running: waiting for instance to start: ResourceNotReady: failed waiting for successful resource state, terminating node: 31ecc3c4-b4b8-47b2-9fd0-52d94514ebd8

from e.client.WaitUntilInstanceRunning(dii). I tried to increase retry timeout, but it didn't help.

hidalgopl commented 3 years ago

FYI I think that the issue we're hitting is related with volume size of the AMI. We started seeing it after we resized attached volume to 500Gb. After my changes from this PR, kip gets to the point when it sends RunInstances request, but instance creation fails and dedicated host almost immediately turns its state to Pending. Screenshot from 2021-01-27 19-13-32 As you see, there's Client.InvalidParameterCombination: Could not create volume with size 500GiB from snapshot 'snap-064acde5cd376cb6b'. I've added logging block devices volume size to debug if correct numbers are passed to RunInstances function call.

kip-provider-0 kip I0127 17:45:33.909398       1 instances.go:493] Starting instance for node: &{{Node v1} {26b5749b-a378-4a77-9ece-05314c889606 map[] 2021-01-27 17:45:33.90770785 +0000 UTC <nil> map[] 08358c1e-5551-4f58-be7a-c4119665a587 default} {mac1.metal ami-0929815870cdeaa46 false false true {   20G false <nil> false <nil>}} {Creating  [] default_buildkite-agent-mac1-metal}}
kip-provider-0 kip I0127 17:45:34.209321       1 instances.go:504] calculated volume size for node: 500
kip-provider-0 kip I0127 17:45:35.209502       1 instances.go:480] checking host h-020fba8d6384e6cc9 availability...
kip-provider-0 kip I0127 17:45:35.281221       1 instances.go:514] Starting node with security groups: [sg-0ee086454488a7451] subnet: 'subnet-2769f140'
kip-provider-0 kip I0127 17:45:35.281250       1 instances.go:516] Block devices for a node
kip-provider-0 kip I0127 17:45:35.281257       1 instances.go:518] Device: /dev/sda1 volume size: 824658735208
kip-provider-0 kip I0127 17:45:36.707965       1 instances.go:552] Started instance: i-01b2507ae043e4275
kip-provider-0 kip I0127 17:45:51.843890       1 instances.go:642] retrying err: ResourceNotReady: failed waiting for successful resource state

there's probably an issue with the volume size, as you may see from logs

hidalgopl commented 3 years ago

Ok, so this PR fixes following issues:

Retrying timeout for checking if instance is available was too small - I increased it
We had hardcoded volume type for all EBSes - gp2. I've added getting volume type from AMI's volumes and using it.
We were checking only for host state (if it's available) without checking if it has free compute (which result in attempts of trying to run an instance on the occupied host) - I've added additional check if it has available vCPUs.
We haven't been checking newly allocated host state - added additional check.

elotl / kip

check if allocated dedicated host is available (with timeout) - fixes #216 #217