Closed cPu1 closed 2 years ago
https://github.com/weaveworks/eksctl/issues/2931 Has lot of details around the problems with clusterProvider
The StackCollection
struct is accessed through the StackManager
interface:
type StackManager interface {
AppendNewClusterStackResource(plan, supportsManagedNodes bool) (bool, error)
CreateStack(name string, stack builder.ResourceSet, tags, parameters map[string]string, errs chan error) error
DeleteStackBySpec(s *Stack) (*Stack, error)
DeleteStackBySpecSync(s *Stack, errs chan error) error
DeleteStackSync(s *Stack) error
DeleteTasksForDeprecatedStacks() (*tasks.TaskTree, error)
DescribeClusterStack() (*Stack, error)
DescribeIAMServiceAccountStacks() ([]*Stack, error)
DescribeNodeGroupStack(nodeGroupName string) (*Stack, error)
DescribeNodeGroupStacks() ([]*Stack, error)
DescribeNodeGroupStacksAndResources() (map[string]StackInfo, error)
DescribeStack(i *Stack) (*Stack, error)
DescribeStackChangeSet(i *Stack, changeSetName string) (*ChangeSet, error)
DescribeStackEvents(i *Stack) ([]*cloudformation.StackEvent, error)
DescribeStacks() ([]*Stack, error)
DoCreateStackRequest(i *Stack, templateData TemplateData, tags, parameters map[string]string, withIAM bool, withNamedIAM bool) error
DoWaitUntilStackIsCreated(i *Stack) error
EnsureMapPublicIPOnLaunchEnabled() error
FixClusterCompatibility() error
GetAutoScalingGroupName(s *Stack) (string, error)
GetClusterStackIfExists() (*Stack, error)
GetFargateStack() (*Stack, error)
GetIAMAddonName(s *Stack) string
GetIAMAddonsStacks() ([]*Stack, error)
GetIAMServiceAccounts() ([]*v1alpha5.ClusterIAMServiceAccount, error)
GetKarpenterStack() (*Stack, error)
GetManagedNodeGroupTemplate(options GetNodegroupOption) (string, error)
GetNodeGroupName(s *Stack) string
GetNodeGroupStackType(options GetNodegroupOption) (v1alpha5.NodeGroupType, error)
GetStackTemplate(stackName string) (string, error)
GetUnmanagedNodeGroupSummaries(name string) ([]*NodeGroupSummary, error)
HasClusterStackUsingCachedList(clusterStackNames []string, clusterName string) (bool, error)
ListClusterStackNames() ([]string, error)
ListIAMServiceAccountStacks() ([]string, error)
ListNodeGroupStacks() ([]NodeGroupStack, error)
ListStacks(statusFilters ...string) ([]*Stack, error)
ListStacksMatching(nameRegex string, statusFilters ...string) ([]*Stack, error)
LookupCloudTrailEvents(i *Stack) ([]*cloudtrail.Event, error)
MakeChangeSetName(action string) string
MakeClusterStackName() string
NewClusterCompatTask() tasks.Task
NewManagedNodeGroupTask(nodeGroups []*v1alpha5.ManagedNodeGroup, forceAddCNIPolicy bool, importer vpc.Importer) *tasks.TaskTree
NewTaskToDeleteAddonIAM(wait bool) (*tasks.TaskTree, error)
NewTaskToDeleteUnownedNodeGroup(clusterName, nodegroup string, eksAPI eksiface.EKSAPI, waitCondition *DeleteWaitCondition) tasks.Task
NewTasksToCreateClusterWithNodeGroups(nodeGroups []*v1alpha5.NodeGroup, managedNodeGroups []*v1alpha5.ManagedNodeGroup, supportsManagedNodes bool, postClusterCreationTasks ...tasks.Task) *tasks.TaskTree
NewTasksToCreateIAMServiceAccounts(serviceAccounts []*v1alpha5.ClusterIAMServiceAccount, oidc *iamoidc.OpenIDConnectManager, clientSetGetter kubernetes.ClientSetGetter) *tasks.TaskTree
NewTasksToDeleteClusterWithNodeGroups(stack *Stack, stacks []NodeGroupStack, deleteOIDCProvider bool, oidc *iamoidc.OpenIDConnectManager, clientSetGetter kubernetes.ClientSetGetter, wait bool, cleanup func(chan error, string) error) (*tasks.TaskTree, error)
NewTasksToDeleteIAMServiceAccounts(serviceAccounts []string, clientSetGetter kubernetes.ClientSetGetter, wait bool) (*tasks.TaskTree, error)
NewTasksToDeleteNodeGroups(stacks []NodeGroupStack, shouldDelete func(_ string) bool, wait bool, cleanup func(chan error, string) error) (*tasks.TaskTree, error)
NewTasksToDeleteOIDCProviderWithIAMServiceAccounts(oidc *iamoidc.OpenIDConnectManager, clientSetGetter kubernetes.ClientSetGetter) (*tasks.TaskTree, error)
NewUnmanagedNodeGroupTask(nodeGroups []*v1alpha5.NodeGroup, forceAddCNIPolicy bool, importer vpc.Importer) *tasks.TaskTree
RefreshFargatePodExecutionRoleARN() error
StackStatusIsNotReady(s *Stack) bool
StackStatusIsNotTransitional(s *Stack) bool
UpdateNodeGroupStack(nodeGroupName, template string, wait bool) error
UpdateStack(options UpdateStackOptions) error
}
This interface works quite well, and is easily mocked out throughout the codebase. However the interface itself is quite large and can/should be split out. Grouping some of the current methods into categories:
DescribeNodeGroupStacks() ([]*Stack, error)
, ListIAMServiceAccountStacks() ([]string, error)
, DescribeStacks() ([]*Stack, error)
*tasks.TaskTree
) e.g.NewTasksToCreateIAMServiceAccounts
, NewTasksToDeleteIAMServiceAccounts
, DeleteTasksForDeprecatedStacks
GetAutoScalingGroupName(s *Stack) (string, error)
, GetIAMAddonName(s *Stack) string
, GetNodeGroupName(s *Stack) string
DeleteStackSync(s *Stack) error
, DeleteStackBySpec(s *Stack) (*Stack, error)
FixClusterCompatibility() error
, AppendNewClusterStackResource(plan, supportsManagedNodes bool) (bool, error)
, EnsureMapPublicIPOnLaunchEnabled() error
RefreshFargatePodExecutionRoleARN() error
, LookupCloudTrailEvents(i *Stack) ([]*cloudtrail.Event, error)
, MakeClusterStackName() string
I propose we move all of the following:
Task
related methods out of here and into their respective pkg/actions/<resource-name>
packages. Similar to how irsa
/addon
/karpenter
/fargate
all currently workGetters
, we currently have a mix of Describe
/Get
/List
Delete
methods (@skarlso might of tried this recently?) into 1 func that has a DeleteOpts
that allow you to customise how the deletion works (via spec/name, sync/async)CreateStack
methods (CreateStack
, DoWaitUntilStackIsCreated
and DoCreateStackRequest
) into 1 funcUpdateStack
methods (UpdateNodeGroupStack
, and UpdateStack
) into 1 funcStackStatusIsNotReady
)a rough outline of what the new interface might look like (just guess work):
type StackManager interface {
CreateStack(name string, stack builder.ResourceSet, tags, parameters map[string]string, errs chan error) error
DeleteStack(d DeleteOpts) (*Stack, error)
GetClusterStack() (*Stack, error)
GetIAMServiceAccountStacks() ([]*Stack, error)
GetNodeGroupStack(nodeGroupName string) (*Stack, error)
GetNodeGroupStacks() ([]*Stack, error)
GetNodeGroupStacksAndResources() (map[string]StackInfo, error)
GetStack(i *Stack) (*Stack, error)
GetStackChangeSet(i *Stack, changeSetName string) (*ChangeSet, error)
GetStackEvents(i *Stack) ([]*cloudformation.StackEvent, error)
GetStacks() ([]*Stack, error)
GetAutoScalingGroupName(s *Stack) (string, error)
GetClusterStackIfExists() (*Stack, error)
GetFargateStack() (*Stack, error)
GetIAMAddonName(s *Stack) string
GetIAMAddonsStacks() ([]*Stack, error)
GetIAMServiceAccounts() ([]*v1alpha5.ClusterIAMServiceAccount, error)
GetKarpenterStack() (*Stack, error)
GetManagedNodeGroupTemplate(options GetNodegroupOption) (string, error)
GetNodeGroupName(s *Stack) string
GetNodeGroupStackType(options GetNodegroupOption) (v1alpha5.NodeGroupType, error)
GetStackTemplate(stackName string) (string, error)
GetUnmanagedNodeGroupSummaries(name string) ([]*NodeGroupSummary, error)
GetClusterStackNames() ([]string, error)
GetIAMServiceAccountStacks() ([]string, error)
GetNodeGroupStacks() ([]NodeGroupStack, error)
GetStacks(statusFilters ...string) ([]*Stack, error)
GetStacksMatching(nameRegex string, statusFilters ...string) ([]*Stack, error)
HasClusterStackUsingCachedList(clusterStackNames []string, clusterName string) (bool, error)
MakeChangeSetName(action string) string
MakeClusterStackName() string
NewClusterCompatTask() tasks.Task
NewManagedNodeGroupTask(nodeGroups []*v1alpha5.ManagedNodeGroup, forceAddCNIPolicy bool, importer vpc.Importer) *tasks.TaskTree
StackStatusIsNotReady(s *Stack) bool
UpdateStack(options UpdateStackOptions) error
}
Reduce the complexity/split out ClusterProvider
, this effort goes hand-in-hand with breaking up the pkg/eks
into more manageable and cohesive pieces
The ClusterProvider
struct, often referred to as ctl
once instantiated, is referenced all over the codebase and contains a lot of logic. Mainly for interacting with the AWS APIs (all of the APIs, even though the package name eks
would imply its just for talking to the EKS API.....)
Part of ClusterProvider
is fronted behind the api.ClusterProvider
interface:
pkg/
// ClusterProvider is the interface to AWS APIs
type ClusterProvider interface {
CloudFormation() cloudformationiface.CloudFormationAPI
CloudFormationRoleARN() string
CloudFormationDisableRollback() bool
ASG() autoscalingiface.AutoScalingAPI
EKS() eksiface.EKSAPI
EC2() ec2iface.EC2API
ELB() elbiface.ELBAPI
ELBV2() elbv2iface.ELBV2API
STS() stsiface.STSAPI
SSM() ssmiface.SSMAPI
IAM() iamiface.IAMAPI
CloudTrail() cloudtrailiface.CloudTrailAPI
CloudWatchLogs() cloudwatchlogsiface.CloudWatchLogsAPI
Region() string
Profile() string
WaitTimeout() time.Duration
ConfigProvider() client.ConfigProvider
Session() *session.Session
}
Mocking out this interface is often how we achieve a lot of mocking of API calls across the codebase. The ClusterProvider
also contains a metric fuck ton of funcs/methods for interacting with the cluster that aren't mocked.
Before deciding what should/shouldn't live with this package, we need to establish what we want this package to be for. Given the package is called eks
, it makes sense it acts as a layer between the code and calling the EKS API. An example of some functions that exist in this package that seem reasonable:
func (c *ClusterProvider) DescribeControlPlane(meta *api.ClusterMeta) (*awseks.Cluster, error)
func (c *ClusterProvider) SupportsManagedNodes(clusterConfig *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) ControlPlaneVersion() string
func ValidateBottlerocketSupport(controlPlaneVersion string, kubeNodeGroups []KubeNodeGroup) error
func (c *ClusterProvider) UpdatePublicAccessCIDRs(clusterConfig *api.ClusterConfig) error
All of these functions provide an example of where the eks
package providers EKS-specific value/knowledge. SupportsManagedNodes
inspects the cluster version and reports back wether the EKS cluster version supports managed nodegroups. UpdatePublicAccessCIDRs
will update the cluster public access cidr settings. These type of func/methods that understand/can process EKS specific information belong her.
pkg/eks
contains:
$ ll
api.go
client.go
compatibility.go
eks.go
fargate.go
logging_retryer.go
nodegroup.go
nodegroup_service.go
tasks.go
update.go
Lets take a look at each file and see whats in it
Contains:
func (p ProviderServices) EKS() eksiface.EKSAPI { return p.eks }
On the face of it a lot of this isn't really EKS specific but more general AWS related. We should move this out of this package (:heavy_check_mark: added to proposal at the end of the comment)
The rest of file is made up of:
func (c *ClusterProvider) NewStackManager(spec *api.ClusterConfig) manager.StackManager
func (c *ClusterProvider) SetAvailabilityZones(spec *api.ClusterConfig, given []string) error
func ParseConfig(data []byte) (*api.ClusterConfig, error)
func LoadConfigFromFile(configFile string) (*api.ClusterConfig, error)
func ResolveAMI(provider api.ClusterProvider, version string, np api.NodePool) error
None of these functions seem related to EKS specifically, so lets move them.
Contains code for setting up the aws-sdk-go sessions. Should be moved along side the AWS APIs code.
Contains methods for validating a clusters configuration. I'm unsure after a first pass through if this really belows in eks
or not. Worth further investigating
// ValidateClusterForCompatibility looks at the cluster stack and check if it's
// compatible with current nodegroup configuration, if it find issues it returns an error
func (c *ClusterProvider) ValidateClusterForCompatibility(cfg *api.ClusterConfig, stackManager manager.StackManager) error {
Contains methods for interacting with clusters. Seems good for this package :heavy_check_mark:
func (c *ClusterProvider) DescribeControlPlane(meta *api.ClusterMeta) (*awseks.Cluster, error)
func (c *ClusterProvider) RefreshClusterStatus(spec *api.ClusterConfig) error
func (c *ClusterProvider) SupportsManagedNodes(clusterConfig *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) SupportsFargate(clusterConfig *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) RefreshClusterStatusIfStale(spec *api.ClusterConfig) error
func (c *ClusterProvider) CanOperate(spec *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) CanOperateWithRefresh(spec *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) CanUpdate(spec *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) ControlPlaneVersion() string
func (c *ClusterProvider) ControlPlaneVPCInfo() awseks.VpcConfigResponse
func (c *ClusterProvider) GetCluster(clusterName string) (*awseks.Cluster, error)
func (c *ClusterProvider) WaitForControlPlane(meta *api.ClusterMeta, clientSet *kubernetes.Clientset) error
func ClusterSupportsManagedNodes(cluster *awseks.Cluster) (bool, error)
func ClusterSupportsFargate(cluster *awseks.Cluster) (bool, error)
func PlatformVersion(platformVersion string) (int, error)
Also contains some stuff not-so related to EKS that should be moved :negative_squared_cross_mark:
func (c *ClusterProvider) NewOpenIDConnectManager(spec *api.ClusterConfig) (*iamoidc.OpenIDConnectManager, error)
func (c *ClusterProvider) LoadClusterIntoSpecFromStack(spec *api.ClusterConfig, stackManager manager.StackManager) error
func (c *ClusterProvider) LoadClusterVPC(spec *api.ClusterConfig, stackManager manager.StackManager) error
Contains funcs for managing fargate. Definitely doesn't belong here :negative_squared_cross_mark:
func (fpt *fargateProfilesTask) Describe() string
func (fpt *fargateProfilesTask) Do(errCh chan error) error
func DoCreateFargateProfiles(config *api.ClusterConfig, fargateClient FargateClient) error
func ScheduleCoreDNSOnFargateIfRelevant(config *api.ClusterConfig, ctl *ClusterProvider, clientSet kubernetes.Interface) error
Code for implementing our logging
for the aws-sdk-go
session
. This should be moved along with the AWS API code changes.
Methods for validating that managed nodegroup configurations are supported by the EKS versions. Seems reasonable & EKS related :heavy_check_mark:
func ValidateFeatureCompatibility(clusterConfig *api.ClusterConfig, kubeNodeGroups []KubeNodeGroup) error
func ValidateBottlerocketSupport(controlPlaneVersion string, kubeNodeGroups []KubeNodeGroup) error
func ValidateManagedNodesSupport(clusterConfig *api.ClusterConfig) error
func VersionSupportsManagedNodes(controlPlaneVersion string) (bool, error)
func ValidateWindowsCompatibility(kubeNodeGroups []KubeNodeGroup, controlPlaneVersion string) error
func ValidateKMSSupport(clusterConfig *api.ClusterConfig, eksVersion string) error
func SupportsWindowsWorkloads(nodeGroups []KubeNodeGroup) bool
func LogWindowsCompatibility(nodeGroups []KubeNodeGroup, clusterMeta *api.ClusterMeta)
Also contains a random stack-related func that should not be here :negative_squared_cross_mark:
func (c *ClusterProvider) GetNodeGroupIAM(stackManager manager.StackManager, ng *api.NodeGroup) error
Methods for translating managed nodegroup configuration into real information, e.g. taking instaceSelector
fields and expanding them, resolving AMIs for when AMIFamily
is specified etc. All of this code seems to apply to managed and unmanaged nodegroup, which means its not necessarily EKS specific knowledge. Consider moving :negative_squared_cross_mark: , but it might also be okay to leave here :heavy_check_mark: :shrug:
func NewNodeGroupService(provider api.ClusterProvider, instanceSelector InstanceSelector) *NodeGroupService
func (m *NodeGroupService) NewAWSSelectorSession(provider api.ClusterProvider)
func (m *NodeGroupService) Normalize(nodePools []api.NodePool, clusterMeta *api.ClusterMeta) error
func (m *NodeGroupService) ExpandInstanceSelectorOptions(nodePools []api.NodePool, clusterAZs []string) error
func (m *NodeGroupService) ValidateLegacySubnetsForNodeGroups(spec *api.ClusterConfig, provider api.ClusterProvider) error
Also contains some stray validation logic that should/code be in nodegroup.go
func (m *NodeGroupService) ValidateExistingNodeGroupsForCompatibility(cfg *api.ClusterConfig, stackManager manager.StackManager) error
And a task related method :shakes-fist: :negative_squared_cross_mark: lets move this into pkg/action/nodegroup
func (m *NodeGroupService) DoAllNodegroupStackTasks(taskTree *tasks.TaskTree, region, name string) error
:panic: tasks shouldn't be here. Move to own packages/inside pkg/action/<resource>
where appropriate
Methods for updating cluster settings and fetching information from the cluster (Getters
in an update.go
are a bit confusing, but seem vaguely related). Seems related to EKS so lets keep it here :heavy_check_mark:
func (c *ClusterProvider) GetCurrentClusterConfigForLogging(spec *api.ClusterConfig) (sets.String, sets.String, error)
func (c *ClusterProvider) UpdateClusterConfigForLogging(cfg *api.ClusterConfig) error
func (c *ClusterProvider) GetCurrentClusterVPCConfig(spec *api.ClusterConfig) (*ClusterVPCConfig, error)
func (c *ClusterProvider) UpdateClusterConfigForEndpoints(cfg *api.ClusterConfig) error
func (c *ClusterProvider) UpdatePublicAccessCIDRs(clusterConfig *api.ClusterConfig) error
func (c *ClusterProvider) EnableKMSEncryption(ctx context.Context, clusterConfig *api.ClusterConfig) error
func (c *ClusterProvider) UpdateClusterVersion(cfg *api.ClusterConfig) (*eks.Update, error)
func (c *ClusterProvider) UpdateClusterVersionBlocking(cfg *api.ClusterConfig) error
As you can see from above a lot happens in here and a lot of it is :magic:. Tackling this will take a lot of worth. A first pass:
Once that is done we can split up the package as whole. The package is called eks
and contains a lot of EKS related logic for querying nodegroups/clusters, but also contains and exposes all of the APIs for AWS in general. This means the package does too much.
pkg/aws/
. The interface for ClusterProvider
will still roughly line up with this, but can be renamed to something more meaningful, e.g. AWSAPIs
or something similar.ClusterProvider
in the eks
package to something more meaningful. Hard to think of a good name but perhaps ClusterManager
(you know I love the word manager). Better names are welcome :D :D
eks
package that handles/exposes the funcs for querying the nodegroups/clusters. This makes mocking these out easier in other parts of the code.This is a good survey, and really any step towards refactoring this monster is a good start.
You set out some good initial steps that can be taken. One suggestion that I have on a first pass is to add those to the list and write specific action items (e.g. move x to y
) so that we can act on them e.g. create new tickets.
This is a good survey, and really any step towards refactoring this monster is a good start.
thanks!
You set out some good initial steps that can be taken. One suggestion that I have on a first pass is to adding those to the list and writing specific action items (e.g.
move x to y
) so that we can act on them e.g. create new tickets.
I've added some emojis to the different sections to try to make it clear what should be moved. I think rather than re-listing all the funcs to be moved, perhaps sub-tasks for tackling each file might be a easier way to split the work up?
What I would do and I think this is something that you are trying to do as well, is the following:
eks
provider, ec2
provider, ssm
provider, etcDoing this, should make mocking easy and specific. You don't have to deal with each and every provider on every turn, you could mock what you need specifically.
As you said, this is a lot of work. Figuring out which belongs where. I think you made a good effort of starting this work. :)
Closing this ticket now. The next steps are to turn the proposal into actionable tickets
Types
StackCollection
andClusterProvider
have a lot of methods, many of which do not necessarily belong in them, thereby reducing cohesion and adding too many responsibilities to a single type. Both types should be split into smaller and more focused types.Related: #4213
As part of this spike ticket, investigate a few possible options for implementation and then we'll discuss as a team what is the best path to take.
Timebox: 2 days