eksctl-io / eksctl

The official CLI for Amazon EKS
https://eksctl.io
Other
4.84k stars 1.4k forks source link

[SPIKE] Split types StackCollection, ClusterProvider etc. and increase cohesion #4655

Closed cPu1 closed 2 years ago

cPu1 commented 2 years ago

Types StackCollection and ClusterProvider have a lot of methods, many of which do not necessarily belong in them, thereby reducing cohesion and adding too many responsibilities to a single type. Both types should be split into smaller and more focused types.

Related: #4213

As part of this spike ticket, investigate a few possible options for implementation and then we'll discuss as a team what is the best path to take.

Timebox: 2 days

Himangini commented 2 years ago

https://github.com/weaveworks/eksctl/issues/2931 Has lot of details around the problems with clusterProvider

aclevername commented 2 years ago

StackCollection

The StackCollection struct is accessed through the StackManager interface:

type StackManager interface {
    AppendNewClusterStackResource(plan, supportsManagedNodes bool) (bool, error)
    CreateStack(name string, stack builder.ResourceSet, tags, parameters map[string]string, errs chan error) error
    DeleteStackBySpec(s *Stack) (*Stack, error)
    DeleteStackBySpecSync(s *Stack, errs chan error) error
    DeleteStackSync(s *Stack) error
    DeleteTasksForDeprecatedStacks() (*tasks.TaskTree, error)
    DescribeClusterStack() (*Stack, error)
    DescribeIAMServiceAccountStacks() ([]*Stack, error)
    DescribeNodeGroupStack(nodeGroupName string) (*Stack, error)
    DescribeNodeGroupStacks() ([]*Stack, error)
    DescribeNodeGroupStacksAndResources() (map[string]StackInfo, error)
    DescribeStack(i *Stack) (*Stack, error)
    DescribeStackChangeSet(i *Stack, changeSetName string) (*ChangeSet, error)
    DescribeStackEvents(i *Stack) ([]*cloudformation.StackEvent, error)
    DescribeStacks() ([]*Stack, error)
    DoCreateStackRequest(i *Stack, templateData TemplateData, tags, parameters map[string]string, withIAM bool, withNamedIAM bool) error
    DoWaitUntilStackIsCreated(i *Stack) error
    EnsureMapPublicIPOnLaunchEnabled() error
    FixClusterCompatibility() error
    GetAutoScalingGroupName(s *Stack) (string, error)
    GetClusterStackIfExists() (*Stack, error)
    GetFargateStack() (*Stack, error)
    GetIAMAddonName(s *Stack) string
    GetIAMAddonsStacks() ([]*Stack, error)
    GetIAMServiceAccounts() ([]*v1alpha5.ClusterIAMServiceAccount, error)
    GetKarpenterStack() (*Stack, error)
    GetManagedNodeGroupTemplate(options GetNodegroupOption) (string, error)
    GetNodeGroupName(s *Stack) string
    GetNodeGroupStackType(options GetNodegroupOption) (v1alpha5.NodeGroupType, error)
    GetStackTemplate(stackName string) (string, error)
    GetUnmanagedNodeGroupSummaries(name string) ([]*NodeGroupSummary, error)
    HasClusterStackUsingCachedList(clusterStackNames []string, clusterName string) (bool, error)
    ListClusterStackNames() ([]string, error)
    ListIAMServiceAccountStacks() ([]string, error)
    ListNodeGroupStacks() ([]NodeGroupStack, error)
    ListStacks(statusFilters ...string) ([]*Stack, error)
    ListStacksMatching(nameRegex string, statusFilters ...string) ([]*Stack, error)
    LookupCloudTrailEvents(i *Stack) ([]*cloudtrail.Event, error)
    MakeChangeSetName(action string) string
    MakeClusterStackName() string
    NewClusterCompatTask() tasks.Task
    NewManagedNodeGroupTask(nodeGroups []*v1alpha5.ManagedNodeGroup, forceAddCNIPolicy bool, importer vpc.Importer) *tasks.TaskTree
    NewTaskToDeleteAddonIAM(wait bool) (*tasks.TaskTree, error)
    NewTaskToDeleteUnownedNodeGroup(clusterName, nodegroup string, eksAPI eksiface.EKSAPI, waitCondition *DeleteWaitCondition) tasks.Task
    NewTasksToCreateClusterWithNodeGroups(nodeGroups []*v1alpha5.NodeGroup, managedNodeGroups []*v1alpha5.ManagedNodeGroup, supportsManagedNodes bool, postClusterCreationTasks ...tasks.Task) *tasks.TaskTree
    NewTasksToCreateIAMServiceAccounts(serviceAccounts []*v1alpha5.ClusterIAMServiceAccount, oidc *iamoidc.OpenIDConnectManager, clientSetGetter kubernetes.ClientSetGetter) *tasks.TaskTree
    NewTasksToDeleteClusterWithNodeGroups(stack *Stack, stacks []NodeGroupStack, deleteOIDCProvider bool, oidc *iamoidc.OpenIDConnectManager, clientSetGetter kubernetes.ClientSetGetter, wait bool, cleanup func(chan error, string) error) (*tasks.TaskTree, error)
    NewTasksToDeleteIAMServiceAccounts(serviceAccounts []string, clientSetGetter kubernetes.ClientSetGetter, wait bool) (*tasks.TaskTree, error)
    NewTasksToDeleteNodeGroups(stacks []NodeGroupStack, shouldDelete func(_ string) bool, wait bool, cleanup func(chan error, string) error) (*tasks.TaskTree, error)
    NewTasksToDeleteOIDCProviderWithIAMServiceAccounts(oidc *iamoidc.OpenIDConnectManager, clientSetGetter kubernetes.ClientSetGetter) (*tasks.TaskTree, error)
    NewUnmanagedNodeGroupTask(nodeGroups []*v1alpha5.NodeGroup, forceAddCNIPolicy bool, importer vpc.Importer) *tasks.TaskTree
    RefreshFargatePodExecutionRoleARN() error
    StackStatusIsNotReady(s *Stack) bool
    StackStatusIsNotTransitional(s *Stack) bool
    UpdateNodeGroupStack(nodeGroupName, template string, wait bool) error
    UpdateStack(options UpdateStackOptions) error
}

This interface works quite well, and is easily mocked out throughout the codebase. However the interface itself is quite large and can/should be split out. Grouping some of the current methods into categories:

Proposal

I propose we move all of the following:

a rough outline of what the new interface might look like (just guess work):

type StackManager interface {
    CreateStack(name string, stack builder.ResourceSet, tags, parameters map[string]string, errs chan error) error
    DeleteStack(d DeleteOpts) (*Stack, error)
    GetClusterStack() (*Stack, error)
    GetIAMServiceAccountStacks() ([]*Stack, error)
    GetNodeGroupStack(nodeGroupName string) (*Stack, error)
    GetNodeGroupStacks() ([]*Stack, error)
    GetNodeGroupStacksAndResources() (map[string]StackInfo, error)
    GetStack(i *Stack) (*Stack, error)
    GetStackChangeSet(i *Stack, changeSetName string) (*ChangeSet, error)
    GetStackEvents(i *Stack) ([]*cloudformation.StackEvent, error)
    GetStacks() ([]*Stack, error)
    GetAutoScalingGroupName(s *Stack) (string, error)
    GetClusterStackIfExists() (*Stack, error)
    GetFargateStack() (*Stack, error)
    GetIAMAddonName(s *Stack) string
    GetIAMAddonsStacks() ([]*Stack, error)
    GetIAMServiceAccounts() ([]*v1alpha5.ClusterIAMServiceAccount, error)
    GetKarpenterStack() (*Stack, error)
    GetManagedNodeGroupTemplate(options GetNodegroupOption) (string, error)
    GetNodeGroupName(s *Stack) string
    GetNodeGroupStackType(options GetNodegroupOption) (v1alpha5.NodeGroupType, error)
    GetStackTemplate(stackName string) (string, error)
    GetUnmanagedNodeGroupSummaries(name string) ([]*NodeGroupSummary, error)
    GetClusterStackNames() ([]string, error)
    GetIAMServiceAccountStacks() ([]string, error)
    GetNodeGroupStacks() ([]NodeGroupStack, error)
    GetStacks(statusFilters ...string) ([]*Stack, error)
    GetStacksMatching(nameRegex string, statusFilters ...string) ([]*Stack, error)
    HasClusterStackUsingCachedList(clusterStackNames []string, clusterName string) (bool, error)
    MakeChangeSetName(action string) string
    MakeClusterStackName() string
    NewClusterCompatTask() tasks.Task
    NewManagedNodeGroupTask(nodeGroups []*v1alpha5.ManagedNodeGroup, forceAddCNIPolicy bool, importer vpc.Importer) *tasks.TaskTree
    StackStatusIsNotReady(s *Stack) bool
    UpdateStack(options UpdateStackOptions) error
}
aclevername commented 2 years ago

Goals

Reduce the complexity/split out ClusterProvider, this effort goes hand-in-hand with breaking up the pkg/eks into more manageable and cohesive pieces

The problem

ClusterProvider

The ClusterProvider struct, often referred to as ctl once instantiated, is referenced all over the codebase and contains a lot of logic. Mainly for interacting with the AWS APIs (all of the APIs, even though the package name eks would imply its just for talking to the EKS API.....)

Part of ClusterProvider is fronted behind the api.ClusterProvider interface:

pkg/
// ClusterProvider is the interface to AWS APIs
type ClusterProvider interface {
    CloudFormation() cloudformationiface.CloudFormationAPI
    CloudFormationRoleARN() string
    CloudFormationDisableRollback() bool
    ASG() autoscalingiface.AutoScalingAPI
    EKS() eksiface.EKSAPI
    EC2() ec2iface.EC2API
    ELB() elbiface.ELBAPI
    ELBV2() elbv2iface.ELBV2API
    STS() stsiface.STSAPI
    SSM() ssmiface.SSMAPI
    IAM() iamiface.IAMAPI
    CloudTrail() cloudtrailiface.CloudTrailAPI
    CloudWatchLogs() cloudwatchlogsiface.CloudWatchLogsAPI
    Region() string
    Profile() string
    WaitTimeout() time.Duration
    ConfigProvider() client.ConfigProvider
    Session() *session.Session
}

Mocking out this interface is often how we achieve a lot of mocking of API calls across the codebase. The ClusterProvider also contains a metric fuck ton of funcs/methods for interacting with the cluster that aren't mocked.

Before deciding what should/shouldn't live with this package, we need to establish what we want this package to be for. Given the package is called eks, it makes sense it acts as a layer between the code and calling the EKS API. An example of some functions that exist in this package that seem reasonable:

func (c *ClusterProvider) DescribeControlPlane(meta *api.ClusterMeta) (*awseks.Cluster, error)
func (c *ClusterProvider) SupportsManagedNodes(clusterConfig *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) ControlPlaneVersion() string
func ValidateBottlerocketSupport(controlPlaneVersion string, kubeNodeGroups []KubeNodeGroup) error
func (c *ClusterProvider) UpdatePublicAccessCIDRs(clusterConfig *api.ClusterConfig) error

All of these functions provide an example of where the eks package providers EKS-specific value/knowledge. SupportsManagedNodes inspects the cluster version and reports back wether the EKS cluster version supports managed nodegroups. UpdatePublicAccessCIDRs will update the cluster public access cidr settings. These type of func/methods that understand/can process EKS specific information belong her.

What can stay and what should be moved

pkg/eks contains:

$ ll
api.go
client.go
compatibility.go
eks.go
fargate.go
logging_retryer.go
nodegroup.go
nodegroup_service.go
tasks.go
update.go

Lets take a look at each file and see whats in it

api.go

Contains:

On the face of it a lot of this isn't really EKS specific but more general AWS related. We should move this out of this package (:heavy_check_mark: added to proposal at the end of the comment)

The rest of file is made up of:

func (c *ClusterProvider) NewStackManager(spec *api.ClusterConfig) manager.StackManager
func (c *ClusterProvider) SetAvailabilityZones(spec *api.ClusterConfig, given []string) error
func ParseConfig(data []byte) (*api.ClusterConfig, error)
func LoadConfigFromFile(configFile string) (*api.ClusterConfig, error)
func ResolveAMI(provider api.ClusterProvider, version string, np api.NodePool) error

None of these functions seem related to EKS specifically, so lets move them.

client.go

Contains code for setting up the aws-sdk-go sessions. Should be moved along side the AWS APIs code.

compatibility.go

Contains methods for validating a clusters configuration. I'm unsure after a first pass through if this really belows in eks or not. Worth further investigating

// ValidateClusterForCompatibility looks at the cluster stack and check if it's
// compatible with current nodegroup configuration, if it find issues it returns an error
func (c *ClusterProvider) ValidateClusterForCompatibility(cfg *api.ClusterConfig, stackManager manager.StackManager) error {

eks.go

Contains methods for interacting with clusters. Seems good for this package :heavy_check_mark:

func (c *ClusterProvider) DescribeControlPlane(meta *api.ClusterMeta) (*awseks.Cluster, error)
func (c *ClusterProvider) RefreshClusterStatus(spec *api.ClusterConfig) error
func (c *ClusterProvider) SupportsManagedNodes(clusterConfig *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) SupportsFargate(clusterConfig *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) RefreshClusterStatusIfStale(spec *api.ClusterConfig) error
func (c *ClusterProvider) CanOperate(spec *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) CanOperateWithRefresh(spec *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) CanUpdate(spec *api.ClusterConfig) (bool, error)
func (c *ClusterProvider) ControlPlaneVersion() string
func (c *ClusterProvider) ControlPlaneVPCInfo() awseks.VpcConfigResponse
func (c *ClusterProvider) GetCluster(clusterName string) (*awseks.Cluster, error)
func (c *ClusterProvider) WaitForControlPlane(meta *api.ClusterMeta, clientSet *kubernetes.Clientset) error
func ClusterSupportsManagedNodes(cluster *awseks.Cluster) (bool, error)
func ClusterSupportsFargate(cluster *awseks.Cluster) (bool, error)
func PlatformVersion(platformVersion string) (int, error)

Also contains some stuff not-so related to EKS that should be moved :negative_squared_cross_mark:

func (c *ClusterProvider) NewOpenIDConnectManager(spec *api.ClusterConfig) (*iamoidc.OpenIDConnectManager, error)
func (c *ClusterProvider) LoadClusterIntoSpecFromStack(spec *api.ClusterConfig, stackManager manager.StackManager) error
func (c *ClusterProvider) LoadClusterVPC(spec *api.ClusterConfig, stackManager manager.StackManager) error

fargate.go

Contains funcs for managing fargate. Definitely doesn't belong here :negative_squared_cross_mark:

func (fpt *fargateProfilesTask) Describe() string
func (fpt *fargateProfilesTask) Do(errCh chan error) error
func DoCreateFargateProfiles(config *api.ClusterConfig, fargateClient FargateClient) error
func ScheduleCoreDNSOnFargateIfRelevant(config *api.ClusterConfig, ctl *ClusterProvider, clientSet kubernetes.Interface) error

logging_retryer.go

Code for implementing our logging for the aws-sdk-go session. This should be moved along with the AWS API code changes.

nodegroup.go

Methods for validating that managed nodegroup configurations are supported by the EKS versions. Seems reasonable & EKS related :heavy_check_mark:

func ValidateFeatureCompatibility(clusterConfig *api.ClusterConfig, kubeNodeGroups []KubeNodeGroup) error
func ValidateBottlerocketSupport(controlPlaneVersion string, kubeNodeGroups []KubeNodeGroup) error
func ValidateManagedNodesSupport(clusterConfig *api.ClusterConfig) error
func VersionSupportsManagedNodes(controlPlaneVersion string) (bool, error)
func ValidateWindowsCompatibility(kubeNodeGroups []KubeNodeGroup, controlPlaneVersion string) error
func ValidateKMSSupport(clusterConfig *api.ClusterConfig, eksVersion string) error
func SupportsWindowsWorkloads(nodeGroups []KubeNodeGroup) bool
func LogWindowsCompatibility(nodeGroups []KubeNodeGroup, clusterMeta *api.ClusterMeta)

Also contains a random stack-related func that should not be here :negative_squared_cross_mark:

func (c *ClusterProvider) GetNodeGroupIAM(stackManager manager.StackManager, ng *api.NodeGroup) error

nodegroup_service.go

Methods for translating managed nodegroup configuration into real information, e.g. taking instaceSelector fields and expanding them, resolving AMIs for when AMIFamily is specified etc. All of this code seems to apply to managed and unmanaged nodegroup, which means its not necessarily EKS specific knowledge. Consider moving :negative_squared_cross_mark: , but it might also be okay to leave here :heavy_check_mark: :shrug:

func NewNodeGroupService(provider api.ClusterProvider, instanceSelector InstanceSelector) *NodeGroupService
func (m *NodeGroupService) NewAWSSelectorSession(provider api.ClusterProvider)
func (m *NodeGroupService) Normalize(nodePools []api.NodePool, clusterMeta *api.ClusterMeta) error
func (m *NodeGroupService) ExpandInstanceSelectorOptions(nodePools []api.NodePool, clusterAZs []string) error
func (m *NodeGroupService) ValidateLegacySubnetsForNodeGroups(spec *api.ClusterConfig, provider api.ClusterProvider) error

Also contains some stray validation logic that should/code be in nodegroup.go

func (m *NodeGroupService) ValidateExistingNodeGroupsForCompatibility(cfg *api.ClusterConfig, stackManager manager.StackManager) error

And a task related method :shakes-fist: :negative_squared_cross_mark: lets move this into pkg/action/nodegroup

func (m *NodeGroupService) DoAllNodegroupStackTasks(taskTree *tasks.TaskTree, region, name string) error

task.go

:panic: tasks shouldn't be here. Move to own packages/inside pkg/action/<resource> where appropriate

update.go

Methods for updating cluster settings and fetching information from the cluster (Getters in an update.go are a bit confusing, but seem vaguely related). Seems related to EKS so lets keep it here :heavy_check_mark:

func (c *ClusterProvider) GetCurrentClusterConfigForLogging(spec *api.ClusterConfig) (sets.String, sets.String, error)
func (c *ClusterProvider) UpdateClusterConfigForLogging(cfg *api.ClusterConfig) error
func (c *ClusterProvider) GetCurrentClusterVPCConfig(spec *api.ClusterConfig) (*ClusterVPCConfig, error)
func (c *ClusterProvider) UpdateClusterConfigForEndpoints(cfg *api.ClusterConfig) error
func (c *ClusterProvider) UpdatePublicAccessCIDRs(clusterConfig *api.ClusterConfig) error
func (c *ClusterProvider) EnableKMSEncryption(ctx context.Context, clusterConfig *api.ClusterConfig) error
func (c *ClusterProvider) UpdateClusterVersion(cfg *api.ClusterConfig) (*eks.Update, error)
func (c *ClusterProvider) UpdateClusterVersionBlocking(cfg *api.ClusterConfig) error

Refactors

As you can see from above a lot happens in here and a lot of it is :magic:. Tackling this will take a lot of worth. A first pass:

Once that is done we can split up the package as whole. The package is called eks and contains a lot of EKS related logic for querying nodegroups/clusters, but also contains and exposes all of the APIs for AWS in general. This means the package does too much.

nikimanoledaki commented 2 years ago

This is a good survey, and really any step towards refactoring this monster is a good start.

You set out some good initial steps that can be taken. One suggestion that I have on a first pass is to add those to the list and write specific action items (e.g. move x to y) so that we can act on them e.g. create new tickets.

aclevername commented 2 years ago

This is a good survey, and really any step towards refactoring this monster is a good start.

thanks!

You set out some good initial steps that can be taken. One suggestion that I have on a first pass is to adding those to the list and writing specific action items (e.g. move x to y) so that we can act on them e.g. create new tickets.

I've added some emojis to the different sections to try to make it clear what should be moved. I think rather than re-listing all the funcs to be moved, perhaps sub-tasks for tackling each file might be a easier way to split the work up?

Skarlso commented 2 years ago

What I would do and I think this is something that you are trying to do as well, is the following:

Doing this, should make mocking easy and specific. You don't have to deal with each and every provider on every turn, you could mock what you need specifically.

As you said, this is a lot of work. Figuring out which belongs where. I think you made a good effort of starting this work. :)

Himangini commented 2 years ago

Closing this ticket now. The next steps are to turn the proposal into actionable tickets