gardener / machine-controller-manager-provider-aws

Gardener machine controller manager provider for AWS
Apache License 2.0
9 stars 33 forks source link

Invalid MachineClass created during migration from AWSMachineClass #112

Closed mattburgess closed 1 year ago

mattburgess commented 1 year ago

What happened:

We have an AWSMachineClass, deprecated-mcm-integration-test and a MachineDeployment, deprecated-mcm-integration-test which are both applied to the cluster during our integration tests. Although MCM is successfully able to launch instances, we see the following in the logs:

E0327 14:56:43.479695       1 machine.go:142] MachineClass.machine.sapcloud.io "deprecated-mcm-integration-test" is invalid: providerSpec: Invalid value: "null": providerSpec in body must be of type object: "null"

In addition, we don't see a MachineClass, deprecated-mcm-integration-test created in the cluster as we'd expect following a successful migration.

What you expected to happen:

We'd expect to see the AWSMachineClass successfully migrated to a MachineClass with no error logs.

We have --v=3 set to try to get as much logging out as possible but we don't appear to see any logs about attempts to perform migrations. This is hampering our ability to investigate this ourselves.

How to reproduce it (as minimally and precisely as possible):

apiVersion: machine.sapcloud.io/v1alpha1
kind: AWSMachineClass
metadata:
  name: deprecated-mcm-integration-test
  namespace: machine-controller-manager-int
nodeTemplate:
  capacity:
    cpu: 2
    gpu: 0
    memory: 8
  instanceType: m6i.large
  region: eu-west-1
  zone: eu-west-1a
provider:
  ami: ami-0c4200cdfb6cd3c5e
  blockDevices:
    - ebs:
        deleteOnTermination: true
        volumeSize: 50
        volumeType: gp2
        encrypted: true
      # /root is a special device name that AWS looks for for the root volume
      deviceName: /root
    - ebs:
        deleteOnTermination: true
        volumeSize: 64
        volumeType: gp2
        encrypted: true
      # must be /dev/sdf to match with the AMI block device for the log volume
      deviceName: /dev/sdf
    - ebs:
        deleteOnTermination: true
        volumeSize: 64
        volumeType: gp2
        encrypted: true
      deviceName: /dev/sdg # docker volume
    - ebs:
        deleteOnTermination: true
        volumeSize: 32
        volumeType: gp2
        encrypted: true
      deviceName: /dev/sdh # kubelet-volume
  iam:
    name: eu-west-1_dev_kubernetes-node-profile
  keyName: core
  machineType: m6i.large
  networkInterfaces:
    - securityGroupIDs:
        - sg-0eb0a652495874124
        - sg-94d2dff3
        - sg-09eec929ebe74d876
      subnetID: subnet-efae67b7
  region: eu-west-1
  secretRef:
    name: mcm-integration-test
    namespace: machine-controller-manager-int
  tags:
    Name: mcm-integration-test
    Role: "kubernetes-node"
    kubernetes.io/cluster/dev: shared
    kubernetes.io/role/kubernetes-node: ""
    provider: aws
    region: eu-west-1
apiVersion: machine.sapcloud.io/v1alpha1
kind: MachineDeployment
metadata:
  name: deprecated-mcm-integration-test
  namespace: machine-controller-manager-int
spec:
  minReadySeconds: 60
  replicas: 1
  selector:
    matchLabels:
      test-class: mcm-integration-test
  strategy:
    rollingUpdate:
      maxSurge: 2
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        test-class: mcm-integration-test
    spec:
      class:
        kind: AWSMachineClass
        name: deprecated-mcm-integration-test
      nodeTemplate:
        metadata:
          labels:
            test-class: mcm-integration-test
        spec:
          taints:
            - effect: NoSchedule
              key: test-class
              value: mcm-integration-test

Anything else we need to know:

We've manually applied a MachineClass using all the same parameters as the AWSMachineClass contains and that is correctly created in the cluster.

Environment:

Kubernetes: v1.22.17 MCM: 0.48.1 MCM-Provider-AWS: 0.17.0

himanshu-kun commented 1 year ago

I see that your AWSMachineClass has the following field used

nodeTemplate:
  capacity:
    cpu: 2
    gpu: 0
    memory: 8
  instanceType: m6i.large
  region: eu-west-1
  zone: eu-west-1a

This was introduced only in MachineClass so it won't be carried over during migration if you specify in AWSMachineClass. I still have to investigate why the problem of migration is happening, but just wanted to clarify on this.

Also we will make provider specific machineClass out of support from next release mcm v0.49.0 , so you should switch to MachineClass now.

mattburgess commented 1 year ago

Thanks for taking a look at this @himanshu-kun. We needed to add that nodeTemplate to the AWSMachineClass otherwise cluster-autoscaler-0.19 and later couldn't scale a node group (MachineDeployment) from 0.

We're in the process of migrating over to MachineClass so that we can get onto a modern version of cluster-autoscaler which has some important fixes that we'd like.

himanshu-kun commented 1 year ago

otherwise cluster-autoscaler-0.19 and later couldn't scale a node group (MachineDeployment) from 0.

it can , actually ,if you add this nodeTemplate to MachineClass.

Given that you are already migrating over to MachineClass , it would make sense to close the issue , let me know if you have concerns.

himanshu-kun commented 1 year ago

/ping @mattburgess

gardener-robot commented 1 year ago

@mattburgess ℹ️ please take some time to help himanshu-kun or redirect to someone else if you can't.

mattburgess commented 1 year ago

Sure, happy to close as our migration to MachineClasses is now underway.