Diagnose deploy-bosh/deploy-staging-bosh failures

ghost commented 8 years ago

Determine why new builds are failing. Differences noted by bosh deploy:

Old versions: BOSH: 257.1 stemcell: bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3262 bosh-aws-cpi: 53

New versions: BOSH: 257.3 stemcell: bosh-aws-xen-hvm-ubuntu-trusty-go_agent/3262.2 bosh-aws-cpi: 54

ghost commented 8 years ago

Apparently, VMs using this stemcell are failing to boot, at least is us-east-1

https://cloudfoundry.slack.com/archives/bosh/p1467923069001921

@jmcarp and I observed that the System Log for the VMs aren't even displayed in the AWS Console; the VM shuts down and terminates before anything is logged.

ghost commented 8 years ago

I manually replaced our existing stemcell 3262.2 (built with Jenkins) with one that Pivotal built with concourse, and now bosh staging deploys.

I see some differences between the two stemcells and am working with the Pivotal guys to figure out what’s up.

Production bosh updated too.

ghost commented 8 years ago

Looking at the debug logs, I found the issue to be that bosh was using an invalid, locally-created ami (ami-0766de10) instead of the one specified in the bosh.io 3262.2 stemcell (ami-613ef00c).

Note the Description below:

AMI used when VM creation was failing:

aws> ec2 describe-images --filters "Name=image-id,Values=ami-0766de10" 
{
    "Images": [
        {
            "VirtualizationType": "him", 
            "Name": "BOSH-9f931b47-049e-40ab-ba9a-a93081e952ce", 
            "Hypervisor": "xen", 
            "ImageId": "ami-0766de10", 
            "State": "available", 
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/xvda", 
                    "Ebs": {
                        "DeleteOnTermination": true, 
                        "SnapshotId": "snap-ec63530a", 
                        "VolumeSize": 3, 
                        "VolumeType": "standard", 
                        "Encrypted": false
                    }
                }
            ], 
            "Architecture": "x86_64", 
            "ImageLocation": "144433228153/BOSH-9f931b47-049e-40ab-ba9a-a93081e952ce", 
            "RootDeviceType": "abs", 
            "OwnerId": "144433228153", 
            "RootDeviceName": "/dev/xvda", 
            "CreationDate": "2016-06-30T19:37:18.000Z", 
            "Public": true, 
            "ImageType": "machine", 
            "Description": "Cloud.gov Bosh Lite Stemcell"
        }
    ]
}

AMI specified by the bosh.io 3262.2 stemcell:

aws> ec2 describe-images --filters "Name=image-id,Values=ami-613ef00c" 
{
    "Images": [
        {
            "VirtualizationType": "him", 
            "Name": "BOSH-cefe0a60-8cc1-4a64-8666-f9441518cea3", 
            "Hypervisor": "xen", 
            "SriovNetSupport": "simple", 
            "ImageId": "ami-613ef00c", 
            "State": "available", 
            "BlockDeviceMappings": [
                {
                    "DeviceName": "/dev/xvda", 
                    "Ebs": {
                        "DeleteOnTermination": true, 
                        "SnapshotId": "snap-ff3df4eb", 
                        "VolumeSize": 3, 
                        "VolumeType": "standard", 
                        "Encrypted": false
                    }
                }, 
                {
                    "DeviceName": "/dev/sdb", 
                    "VirtualName": "ephemeral0"
                }
            ], 
            "Architecture": "x86_64", 
            "ImageLocation": "138312800438/BOSH-cefe0a60-8cc1-4a64-8666-f9441518cea3", 
            "RootDeviceType": "abs", 
            "OwnerId": "138312800438", 
            "RootDeviceName": "/dev/xvda", 
            "CreationDate": "2016-06-28T23:41:03.000Z", 
            "Public": true, 
            "ImageType": "machine", 
            "Description": "bosh-aws-xen-ubuntu-trusty-go_agent 3262.2"
        }
    ]
}

ghost commented 8 years ago

The AMI id was found in the bosh task nnnn --debug logs.

rogeruiz commented 8 years ago

👀 https://www.pivotaltracker.com/n/projects/1133984/stories/124181901

ghost commented 8 years ago

This issue was due to cg-aws-light-stemcell-builder being broken. In addition, the deployed deploy-bosh pipeline is pulling from bosh.io, rather than pulling from our S3 bucket, the stemcell output location from cg-aws-light-stemcell-builder .

ghost commented 8 years ago

When the pipeline was updated to pull from our S3 bucket, concourse reported "error running command: missing path in request”. This was apparently due to worker caching and the resolution was to rename the resource.

According to this post in Slack in the cloudfoundry #bosh channel, possible solutions were to either rename the resource or recreate the worker:

https://cloudfoundry.slack.com/archives/bosh/p1467926131001983

jmcarp commented 8 years ago

PR against upstream here: https://github.com/cloudfoundry-incubator/aws-light-stemcell-builder/pull/4

cloud-gov / cg-atlas

Diagnose deploy-bosh/deploy-staging-bosh failures #105