Closed ndegory closed 2 years ago
@ndegory Thanks for opening the detailed issue and a follow-up PR ⭐ We will review the issue and PR soon. 👍🏻
Hi @ndegory.
You are correct. eksctl
is not declarative. It's imperative. Meaning, you have to run a delete than create if something went wrong. It won't detect existing things.
@Skarlso, fair enough, but right now, under certain conditions, the iamserviceaccount command lets you think the cluster state matches the specifications from the config server YAML file (because no actions, no errors), although some resources are not in the expected state. This action is not atomic, which is problematic for an imperative command. The PR I submitted for this ticket may go too far (I understand there may be concerns about deleting a previously failed stack), if that's the case please tell me, and I'll reduce the scope by only adding more controls for a better output in that particular use case.
@ndegory We decided to pull this into planning and will think of a nice solution that will still leave the command consistent with other commands. :) There was something similar previously that deals with the nature of iamserviceaccount commands here: https://github.com/weaveworks/eksctl/issues/4941. It's not similar in the problem but similar in the nature that existing or non-existing resources throws off the create command and leaves things in an inconsistent state.
Maybe we can still do something here that will not result in a problematic environment or is more user friendly. We'll discuss this with the team.
We will reproduce this on our side and help with the PR for adding the validations
@ndegory Ok, so, the decision is as follows... create will still not be much aware about the circumstances and the infrastructure you run it on. If there were things that you created outside of eksctl that won't really matter for eksctl. Again, it's not a declarative tool.
That said! We can certainly improve upon this part:
Creation of a cluster, with a config file including an IAM service account referring to an IAM policy not yet created (the missing pre-requisite):
Mainly this: (the missing pre-requisite).
Are you willing to adjust your PR to do a check for this resource to exist and only proceed if yes? :)
@Skarlso , yes, I can give it a try. Checking that the IAM policy exists, when the specs specify an IAM policy attachment.
I would also like to add one more thing, which was kind of covered by the current PR, which is to react differently when there's an existing Cloudformation stack for that IAM service account, and exit in error when that existing stack is in ROLLBACK_COMPLETE status, instead of the current behavior which ignores it and considers all is well and there's nothing to do. Would that be ok for you?
@ndegory Sadly, that would be going too far. I mean, that could result in detecting a stack which isn't the stack you want. Or happens to be there because of the same name. If there would have been a stack that had been created during the create and would have failed the create would have failed, right?
Or are you saying that the create just happily jugged on even if the stack failed to CREATE_COMPLETE? If there was a stack created separately from iamserviceaccount create
that would be outside the scope of the create command.
@Skarlso , correct, a stack created by a previous call to iamserviceaccount create
. This is the Cloud, so we know issues can arise (networking, API errors, etc.) resulting in a failure that could be resolved if retried later.
The current problem with this command is that once it fails, retrying it without first deleting the rolled back stack doesn't complain about the state of the stack, it considers no action has to be performed. My suggestion is that the iamserviceaccount create
command should detect that something's wrong and alert the user that something has to be done (such as a manual deletion of the rolled back stack).
My suggestion is that the iamserviceaccount create command should detect that something's wrong and alert the user that something has to be done (such as a manual deletion of the rolled back stack).
Yeah, okay, that is a fair point. I agree to that. Thanks for the explanation!
What I was trying to convey is that it should warn the user and not attempt to remedy the situation. Is that okay?
we're aligned!
Excellent! :)
the PR in question is in the draft, moving this ticket to the Blocked column until the PR is ready to review
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
still topical
@ndegory are you still working on this? you had a PR open but seems like it's closed?
This issue was closed because it has been stalled for 5 days with no activity.
What were you trying to accomplish?
Infrastructure provisioning workflow with 2 steps, first Terraform for IaaS resources, and second eksctl for EKS related resources. The Terraform job includes creation of custom IAM policies that are used by the service accounts defined in the EKS cluster config. When the configuration is not consistent between these two steps, the EKS related job may fail. Fixing it should only require to fix the configuration and run the pipeline again.
What happened?
Creation of IAM resources and Kubernetes service account with the
eksctl create cluster
oreksctl create iamserviceaccount
command fails when pre-requisites are not there (for instance an IAM policy). Fixing the Terraform configuration is enough to let the Terraform job fix the pre-requisites, but when the EKS job runs, it fails to recover.This is caused by:
delete iamserviceaccount
command for the service account impacted by the issue, and then run thecreate iamserviceaccount
command again with the cluster config file, but this is not compatible with a declarative approach.How to reproduce it?
Creation of a cluster, with a config file including an IAM service account referring to an IAM policy not yet created (the missing pre-requisite):
Notice the last service account (another-app), the policy has deliberately not been created.
Logs
Versions
I'll proceed with a PR that implements a more reliable workflow.