CiscoDevNet / cloud-cml

Run Cisco Modeling Labs on cloud infrastructure
https://www.cisco.com/go/cml
Apache License 2.0
41 stars 9 forks source link

Existing VPC Support? #20

Open BobbyGR opened 4 weeks ago

BobbyGR commented 4 weeks ago

Describe the bug We have an existing VPC subnets, gateway etc. How can I use this script?

To Reproduce Steps to reproduce the behavior:

  1. Predefined VPC Subnet and config.yml is configured with with ID/IGW

module.deploy.module.aws[0].aws_security_group.sg_tf: Modifications complete after 1s [id=sg-0a9sf092e03r] ╷ │ Error: creating EC2 Subnet: InvalidSubnet.Conflict: The CIDR '10.10.21.0/21' conflicts with another subnet │ status code: 400, request id: XYC │ │ with module.deploy.module.aws[0].aws_subnet.public_subnet, │ on modules/deploy/aws/main.tf line 251, in resource "aws_subnet" "public_subnet": │ 251: resource "aws_subnet" "public_subnet" { │ ╵ broberts4@macosXYZ: cloud-cml

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Software versions (please complete the following information):

Additional context Add any other context about the problem here.

BobbyGR commented 4 weeks ago

it would seem even if defined it still want's to create the vpc? '# module.deploy.module.aws[0].aws_subnet.public_subnet will be created'

BobbyGR commented 3 weeks ago

hello?

rschmied commented 3 weeks ago
  # when specifying a VPC ID below then this prefix must exist on that VPC!
  public_vpc_ipv4_cidr: 10.0.0.0/16
  enable_ebs_encryption: false
  # leave empty to create a custom VPC / Internet gateway, or provide the IDs
  # of the VPC / gateway to use, they must exist and properly associated.
  # also: an IPv6 CIDR prefix must be associated with the specified VPC
  vpc_id: ""
  gw_id: ""

this is the relevant configuration part that relates to VPC and gateway. You need to set the VPC ID to the ID of the VPC that already exists.

You most likely want to also define a gateway ID that should be used on your VPC. Also take into account the address block that should be used. As the comment suggests, this CIDR block must be available on your existing VPC.

As mentioned in the README, your situation could be that you need to adapt the code / HCL to your specific environment, there's likely no one-size-fits-all.

rschmied commented 3 weeks ago

But your specific error message said that the CIDR block defined in the config can. not be used because it is used elsewhere already. So, in this case, you likely can just specify a different CIDR block which must exist on your VPC. Check what's available and use a block from that one.

BobbyGR commented 3 weeks ago

Hi - all of this is setup but it still try's to add it. With the 8,0 when i tell it a /21 (like example below) it will select a +8 and try to make a /29. If I define the info below, it should be trying to skip right?

aws:
  region: us-west-2
  availability_zone: us-west-2a
  bucket: devnetCML
  flavor: c5.2xlarge
  #flavor: m5zn.metal
  #flavor_compute: m5zn.metal
  flavor_compute: c5.2xlarge
  #This is the profile name for the s3 Buccket "https://us-east-1.console.aws.amazon.com/iam/home?region=us-west-2#/roles"
  profile: s3-access-ec2
  # when specifying a VPC ID below then this prefix must exist on that VPC!
  public_vpc_ipv4_cidr: 10.209.104.0/21
  enable_ebs_encryption: false
  # leave empty to create a custom VPC / Internet gateway, or provide the IDs
  # of the VPC / gateway to use, they must exist and properly associated.
  # also: an IPv6 CIDR prefix must be associated with the specified VPC
  vpc_id: "vpc-XYZ"
  gw_id: "igw-XYZ"
  spot_instances:
    use_spot_for_controller: false
    use_spot_for_computes: false
# module.deploy.module.aws[0].aws_subnet.public_subnet will be created
  + resource "aws_subnet" "public_subnet" {
      + arn                                            = (known after apply)
      + assign_ipv6_address_on_creation                = false
      + availability_zone                              = "us-west-2a"
      + availability_zone_id                           = (known after apply)
      + cidr_block                                     = "10.209.104.0/29"
      + enable_dns64                                   = false
      + enable_resource_name_dns_a_record_on_launch    = false
      + enable_resource_name_dns_aaaa_record_on_launch = false
      + id                                             = (known after apply)
      + ipv6_cidr_block_association_id                 = (known after apply)
      + ipv6_native                                    = false
      + map_public_ip_on_launch                        = true
      + owner_id                                       = (known after apply)
      + private_dns_hostname_type_on_launch            = (known after apply)
      + tags                                           = {
          + "Name" = "CML-public-XYZ"
        }
      + tags_all                                       = {
          + "Name" = "CML-public-XYZ"
        }
      + vpc_id                                         = "vpc-XYZ"
    }
BobbyGR commented 3 weeks ago

│ Error: creating EC2 Subnet: InvalidSubnet.Range: The CIDR '10.209.104.0/29' is invalid.

BobbyGR commented 3 weeks ago

yes, correct. We already have all the subnets, route tables, and security groups pre-defined. We like to just add the EC2 instance only. Is there a way to have the script not create the subnets, route tabls, and security groups. And just only create the EC2 instance.

rschmied commented 3 weeks ago

You will need to adapt the provisioning to allow for that. The current implementation does not have a "one-size-fits-all" solution for every conceivable situation. You want to edit modules/deploy/aws/main.tf so that only required resources are created by removing the unneeded ones.

This requires to identify the existing subnet object via a data source and the interface, you can further simplify the instance resource by removing the cluster parts (minimized this below). You also need to assign an IP to the interface, using an elastic IP (aws_eip). Again, this depends on your specific environment.

resource "aws_network_interface" "pub_int_cml" {
  subnet_id       = data.aws_subnet.YOUR_EXISTING_ONE_HERE.id
  security_groups = [data.aws_security_group.YOUR_EXISTING_ONE_HERE.id]
}

resource "aws_instance" "cml_controller" {
  instance_type        = var.options.cfg.aws.flavor
  ami                  = data.aws_ami.ubuntu.id
  iam_instance_profile = var.options.cfg.aws.profile
  key_name             = var.options.cfg.common.key_name
  ebs_optimized        = "true"
  root_block_device {
    volume_size = var.options.cfg.common.disk_size
    volume_type = "gp3"
    encrypted   = var.options.cfg.aws.enable_ebs_encryption
  }
  network_interface {
    network_interface_id = aws_network_interface.pub_int_cml.id
    device_index         = 0
  }
  user_data = data.cloudinit_config.cml_controller.rendered
}
orndor commented 2 weeks ago

Hi @rschmied ,

I'll have to pile on with @BobbyGR here. I would guess the most common type of deployment for CML would be as BobbyGR describes above:

We like to just add the EC2 instance only. Is there a way to have the script not create the subnets, route tables, and security groups. And just only create the EC2 instance.

This was the functionality that was requested in my previous feature request. I know you've modified the deployment, attempting to solve that, but it's still not hitting the mark. I receive an error as follows:

│ Error: Error in function call │ │ on modules/deploy/aws/main.tf line 253, in resource "aws_subnet" "public_subnet": │ 253: cidr_block = cidrsubnet(var.options.cfg.aws.public_vpc_ipv4_cidr, 8, 0) │ ├──────────────── │ │ while calling cidrsubnet(prefix, newbits, netnum) │ │ var.options.cfg.aws.public_vpc_ipv4_cidr is "10.104.12.128/27" │ │ Call to function "cidrsubnet" failed: insufficient address space to extend prefix of 27 by 8.

It appears the script is attempting to take my initially provided CIDR, and add 8 bits to it? I'm not sure why, if I provide a subnet for which this instance should land in, it's trying to further divide that subnet into a smaller one. It appears that whatever is being called, is using cidr.go: https://github.com/apparentlymart/go-cidr/blob/master/cidr/cidr.go, and taking a parent CIDR to create subnets within it.

I know you've heavily caveated usage of this script in the readme and stated one would need to adapt the provisioning on their own to suit their needs. But again, BobbyGR's and our use case is probably the most common one (deploying this as an instance in an existing VPC, within an existing subnet), so I would recommend a development path towards supporting this type of CML controller deployment.

rschmied commented 2 weeks ago

Regarding the error from above: That's expected as you've provided a /27 and it wants to cut off 8 bits from it which does not work as you only got 5. Besides, the minimum subnet size for AWS is a /28 -- to make it work, the provided original subnet must have be at least a /20 so that the 8 bits from the cidr function results in a /28 subnet.

I've created PR #21 -- can you give this a try? See the PR text which has a bit of an explanation.

orndor commented 2 weeks ago

Thanks @rschmied. Interesting, so it is trying to make a new smaller subnet out of existing subnet. That's a real use case seen in actual deployments that you've built this for? Isolating the CML controller to its own subnet seems not the most efficient.

If this will remain in the code, maybe a note in the deployment details that specifies that if one provides their own subnet, it must be at least a /20.

I will give the PR a try and provide you any feedback. Thanks!

rschmied commented 2 weeks ago

Well, the idea is/was to create the entire infrastructure from scratch, including the VPC. With the VPC creation, a (private) CIDR block needs to be added and for cluster deployments, a couple of subnets are needed:

only the first one is needed in an all-in-one scenario. But, obviously, some subnets need to be carved out of the overall CIDR block.

With your scenario of an all-in-one provisioned into existing infra, I can totally see the (unneeded) complexity managing all the resources.

Would you ever need a cluster deployment? If so: How should that then work, given that there's additional resources needed like the transit gateway, NAT gateways, cluster subnet, ... Would you also want to provide all these additional resources up-front? Or have them created by the tooling? Obviously, the a-i-o approach is much easier, especially if we assume that there's no need to create any network resources.

orndor commented 2 weeks ago

Thanks for the additional background, @rschmied! Your approach probably makes deploying CML an 'easy button' for personal deployments by creating and deploying everything needed. But from an enterprise perspective, and folks with an Enterprise license of CML, it makes it the 'hard button', at least in our case.

My lens is through deploying this in an enterprise AWS environment, with governance controls and tooling already in place to deploy AWS environments with all of those components available, from which we drop or plug in services into those environments. For example, to run this script, I have to break our enterprise IAM model to get it to run (temporarily, of course.) Then, if want it to deploy all the resources via this tooling, I need to jump through additional hoops for approvals, find new IP space, provision and reserve it in IPAM, etc., etc. Further, it doesn't deploy anything with our standardized naming scheme, so breaks soft policy there. Overall, it just breaks the model.

With all that said, it is more difficult to have outside tooling, such as this tool, create all those things than to create them ourselves with our own internal tooling and drop in the thing we need. I don't foresee the need for clustering (I've been able to run 100 N9Kv's on a i3.metal simultaneously as a load test and there was still plenty of resources left), but if that need should ever arise, I would prefer to provide all those resources up front.

Has Cisco considered offering CML as an AMI?

rschmied commented 2 weeks ago

@orndor -- Yes. I've worked on creating an AMI. It's technically not super challenging to do. The problem is policy (as in your case, but from a different angle). As there's many boxes to tick, many permissions to obtain and a lot of corporate red tape when publishing software (in particular software with crypto routines in it) on a public cloud. I wish I had an "easy button" for this one.

BobbyGR commented 2 weeks ago

I was able to get this to work. a few things ill need to get it to prod are just self inflicted, making a SG vs manually building one would be nice. but given the 'hey it's existing' that's understandable. Thanks for all of yalls help. looking forward to see this as an option vs a patch.

BobbyGR commented 2 weeks ago

One thing that stop's this from being flawless is the last step looking for state.

module.ready.data.cml2_system.state: Still reading... [10m0s elapsed]
╷
│ Error: CML2 Provider Error
│ 
│   with module.ready.data.cml2_system.state,

Everything is made correctly, what is it looking for here?

rschmied commented 2 weeks ago

The final step involves attempting to connect to the CML API to determine its "ready" status. This connection is initiated by a client running on the same computer as Terraform. Therefore, it's essential that your computer can reach the newly created AWS instance on TCP port 443 to communicate with CML.

Whether you can make a direct connection may depend on if you've assigned an Elastic IP to the instance. If not, you might need to use the instance's private IP address. If this is the case, you'll need to update the output in the configuration file. I've included a comment for guidance (see aws-mini/main.tf line 63 and aws-mini/output.tf).

You'll need to ensure that your computer—the one running Terraform—can establish this connection. If it cannot, you should remove the relevant sections from the top-level main.tf, specifically the last two blocks (provider "cml2" and module "ready"). Without these blocks, you won't receive a direct indication of when the system is ready. However, it should typically be up and running within about 10 minutes. This could be a good time to take a short break and enjoy a coffee. 😀

rschmied commented 2 weeks ago

I guess I should add that this is now down to security policies, firewall rules, routing and so on.

BobbyGR commented 1 day ago

So - Finally got this on metal and now i'm hitting issues where I can't run any nodes.

I'm getting 'cml-controller is not a simulator'

It looks like my AWS file doesn't have all of the cluster settings? so now i need to piece and part away between the two to make sure the cluster settings are setup?

Running

sudo HOME=/var/local/virl2 /usr/local/bin/virl2-initial-setup.py --reconfigure

results in

No system user password set!
No controller user password set!
Non-interactive setup cannot proceed

What's weird is everything is working as expected, except for the starting of nodes. This might be a challenge of getting this setup and working. Any insights here?

rschmied commented 15 hours ago

Hi, @BobbyGR -- my guess is that you haven't pulled the latest commit from the branch as that contains a change which specifically fixes what you describe (running nodes on the controller). https://github.com/CiscoDevNet/cloud-cml/pull/21/commits/7d013d4e12c77ccd213767bcdacc7b43809355ab

Running the initial setup script with 'reconfigure' does not work as the configuration file says it should run "non-interactive" and since it did run already via cloud-init, it removed all the passwords from the configuration file (for security reasons) but left the non-interactive flag in place. Maybe I could reset the flag to "interactive" so that reconfiguration is possible. However, I don't think that this should be required on a cloud instance as the cloud instance should be configured "correct" right from the get-go.