Open kartikjoshi21 opened 1 month ago
export AZURE_RESOURCE_GROUP="TestPR-1846"
export AZURE_REGION="eastus"
az group create --name "${AZURE_RESOURCE_GROUP}" \
--location "${AZURE_REGION}"
export AZURE_SUBSCRIPTION_ID=$(az account show --query id --output tsv) export USER_ASSIGNED_IDENTITY_NAME="caa-${AZURE_RESOURCE_GROUP}" az identity create \ --name "${USER_ASSIGNED_IDENTITY_NAME}" \ --resource-group "${AZURE_RESOURCE_GROUP}" \ --location "${AZURE_REGION}" \ --subscription "${AZURE_SUBSCRIPTION_ID}"
export PRINCIPAL_ID="$(az identity show \ --name "${USER_ASSIGNED_IDENTITY_NAME}" \ --resource-group "${AZURE_RESOURCE_GROUP}" \ --subscription "${AZURE_SUBSCRIPTION_ID}" --query principalId -otsv)"
sleep 30 az role assignment create \ --role Contributor \ --assignee-object-id "${PRINCIPAL_ID}" \ --scope "/subscriptions/${AZURE_SUBSCRIPTION_ID}"
export AZURE_CLIENT_ID="$(az identity show \ --resource-group "${AZURE_RESOURCE_GROUP}" \ --name "${USER_ASSIGNED_IDENTITY_NAME}" --query 'clientId' -otsv)"
export CLUSTER_NAME="e2e" export AZURE_IMAGE_ID="/CommunityGalleries/cocopodvm-d0e4f35f-5530-4b9c-8596-112487cdea85/Images/podvm_image0/Versions/WrongImageVersion"
cat <
AZURE_CLI_AUTH="true" MANAGED_IDENTITY_NAME="${USER_ASSIGNED_IDENTITY_NAME}"
KBS_IMAGE="ghcr.io/confidential-containers/staged-images/kbs" KBS_IMAGE_TAG="dc01f454264fb4350e5f69eba05683a9a1882c41"
CAA_IMAGE="Use Image with this change" # quay.io/karikjoshi21/cloud-api-adaptor:latest (can use this prebuild image) EOF
pushd src/cloud-api-adaptor/ ssh-keygen -t rsa -b 4096 -f install/overlays/azure/id_rsa -N "" -C dev@coco.io
pushd test git clone git@github.com:confidential-containers/trustee.git
pushd trustee git checkout dc01f454264fb4350e5f69eba05683a9a1882c41 popd popd
export TEST_PROVISION_FILE=/tmp/provision_azure.properties export CLOUD_PROVIDER=azure export BUILTIN_CLOUD_PROVIDERS=azure export DEPLOY_KBS=true export TEST_PROVISION=true
pushd test/tools make caa-provisioner-cli ./caa-provisioner-cli -action=provision
popd
2. Check CAA Pods logs
: not found
3. There shouldn't be any network interfaces created on azure resource portal.
hmm, like we discussed offline, I am not convinced this a good approach. I think this very check is already performed by the
createVM
api call and I'm not sure what the additional value is by doing it manually.If we would start checking preconditons before vm start, why stop at the image, we could check for subnet id, iam rights, instance size availability in a region, etc...
So i checked without this pre image checking condition multiple network interfaces are being created even if image is not available until the range of available addresses is full, so looks like CreateVM api call doesn't actually prevent that as it was created as part of createNetworkInterface
in CreateInstance
API call
Also we have started with checking for image existence as that was our major concern as might be the case image is deleted or not present in that particular region. We can later on extend this to add more pre checks.
So i checked without this pre image checking condition multiple network interfaces are being created even if image is not available until the range of available addresses is full, so looks like CreateVM api call doesn't actually prevent that as it was created as part of
createNetworkInterface
inCreateInstance
API callAlso we have started with checking for image existence as that was our major concern as might be the case image is deleted or not present in that particular region. We can later on extend this to add more pre checks.
This will plug a particular hole (i.e. image doesn't exist in this region) but there are multiple other ways VM creation can fail, even if the image is present, I don't think we can reasonably cover all those error cases. So, for example, if we're not allowed to spawn a VM in the resource group, because of a typo, we will run into the same problems.
2 questions:
result, err := p.create(ctx, vmParameters)
if err != nil {
if err := p.deleteDisk(context.Background(), diskName); err != nil {
logger.Printf("deleting disk (%s): %s", diskName, err)
}
if err := p.deleteNetworkInterfaceAsync(context.Background(), nicName); err != nil {
logger.Printf("deleting nic async (%s): %s", nicName, err)
}
return nil, fmt.Errorf("Creating instance (%v): %s", result, err)
}
after a failed VM creation we're supposed to delete the network interface, why do the NICs still leak on failed vm creation? maybe there is a bug in this logic.
the other question: do we have to create the network interface in a separate API call?
looking at the ARM template documentation it should be possible to specify a subnetId in the vm creation call directly?
"properties": {
"ipConfigurations": [
{
"name": "ipconfig1",
"properties": {
"privateIPAllocationMethod": "Dynamic",
"publicIPAddress": {
"id": "[resourceId('Microsoft.Network/publicIPAddresses', parameters('publicIpName'))]"
},
"subnet": {
"id": "[resourceId('Microsoft.Network/virtualNetworks/subnets', variables('virtualNetworkName'), variables('subnetName'))]"
}
}
}
]
},
@kartikjoshi21 I think the problem with the NIC cleanup not working after a failed VM creation is because it's the deletion operation is async. So it will attempt to
this way you'll end up creation a lot of NICs in a busy loop. I think we should delete the NIC synchronous. That should address the problem. then we don't have to probe for image existence.
@kartikjoshi21 I think the problem with the NIC cleanup not working after a failed VM creation is because it's the deletion operation is async. So it will attempt to
- create a nic
- create a vm
- fail to create vm
- trigger async delete of nic
- repeat
this way you'll end up creation a lot of NICs in a busy loop. I think we should delete the NIC synchronous. That should address the problem. then we don't have to probe for image existence.
Yess i checked that, looks like this is the problem. I will modify this change. Thanks @mkulke
Add check to verify if podvm image is present
Fixes: #1842