cloudfoundry / docs-running-cf

A place for docs related to running, monitoring, and troubleshooting a Cloud Foundry deployment
Apache License 2.0
10 stars 68 forks source link

cf deploy cf-deployment v15.1.0 failed with vm nats/$(instanceId} failing with --ps metrics-discovery-registrar failing #93

Closed junetam11 closed 3 years ago

junetam11 commented 3 years ago

I did following:

  1. bosh-deployment using github repo from gh repo clone cloudfoundry/bosh-deployment using bosh-lite. On AWS. Succeed and the single instance launched with no error

_$bosh create-env bosh-deployment/bosh.yml --state=state.json --vars-store=creds.yml -o bosh-deployment/aws/cpi.yml -o bosh-deployment/bosh-lite.yml -o bosh-deployment/bosh-lite-runc.yml -o bosh-deployment/jumpbox-user.yml -o bosh-deployment/external-ip-with-registry-not-recommended.yml -v director_name=bosh-1 -v internal_cidr=10.0.0.0/24 -v internal_gw=10.0.0.1 -v internal_ip=10.0.0.6 -v access_key_id=${myKeyId} -v secret_access_key=${myKey} -v region=ap-southeast-1 -v az=ap-southeast-1c -v default_key_name=bosh -v default_security_groups=[bosh] --var-file private_key=${myPemPath} -v subnet_id=${myAwsSubnetId} -v external_ip=${aws_elasticip}

  1. cf-deployment using gh repo clone cloudfoundry/cf-deployment

_$bosh -e bosh-lite -d cf deploy cf-deployment/cf-deployment.yml -o cf-deployment/operations/bosh-lite.yml --vars-store deployment-vars.yml -v system_domain=${aws_elasticip}.sslip.io

The Cloud Foundry deployment run for ~ 50 minutes installing a lot of VMs. And failed with some status.

  1. From bosh instances --ps, it does report most VMs instance running in green except for the VM instance name "nats/${vmId}" with process name "metrics-discovery-registrar" error.

$bosh -e bosh-lite -d cf instances --ps

Instance Process Process State AZ IPs api/3e996907-6701-4891-9e5c-247fb665f021 - running z1 10.244.0.133 cc-worker/765b6ce4-f6fd-44f9-85b8-cca314a8f6e8 - running z1 10.244.0.134 credhub/2b5967cd-0ccd-4648-be47-aa3e1a876030 - running z1 10.244.0.140 database/b2cf85e1-ef8a-4225-a5a1-6fc76195fc60 - running z1 10.244.0.129 diego-api/e33a1c1c-69c4-4b15-82f5-1ebd709e339f - running z1 10.244.0.130 diego-cell/c5b5d0cf-b427-45cd-bceb-4afc4b575002 - running z1 10.244.0.138 doppler/203f3eec-4200-459b-95dd-fbc1301876c3 - running z1 10.244.0.137 log-api/d407c05b-708b-4f8b-915e-edcf95fa01e6 - running z1 10.244.0.139 nats/894d2ca9-e18a-43b2-a1ea-049d7c5dc669 - failing z1 10.244.0.128 ~ loggr-forwarder-agent running - - ~ loggr-syslog-agent running - - ~ loggregator_agent running - - ~ metrics-agent running - - ~ metrics-discovery-registrar unknown - - ~ nats running - - ~ nats-tls running - - ~ prom_scraper running - - rotate-cc-database-key/b4cfe754-accd-42b6-b1ac-a387d20f46b4 - - z1 - router/1144e8db-dd07-4ee3-96f6-b3feae1c8a7d - running z1 10.244.0.34 scheduler/b07d9ce3-0b5a-4db6-bc5d-7fd0761d7644 - running z1 10.244.0.135 singleton-blobstore/7d1eeed4-51e1-47f3-989f-7dd20adf581a - running z1 10.244.0.132 smoke-tests/585412a3-ac1c-4a42-b7ea-a303ca00ecf5 - - z1 - tcp-router/5f150b72-54cd-4cee-8735-ce588c517ceb - running z1 10.244.0.136

boshInstancesPs

  1. I ssh into the problematic nats/${vmId" and monit status. Attempted monit restart all process but it still failed at "metrics-discovery-registrat"

_nats/894d2ca9-e18a-43b2-a1ea-049d7c5dc669:/root# monit restart all nats/894d2ca9-e18a-43b2-a1ea-049d7c5dc669:/root# monit summary The Monit daemon 5.2.5 uptime: 1h 9m

Process 'nats' running Process 'nats-tls' running Process 'loggregator_agent' running Process 'loggr-forwarder-agent' running Process 'loggr-syslog-agent' running Process 'prom_scraper' running Process 'metrics-discovery-registrar' not monitored Process 'metrics-agent' initializing System 'systemlocalhost' initializing nats/894d2ca9-e18a-43b2-a1ea-049d7c5dc669:/root#

  1. Attempted on AWS Security Group's Inbound Rule to add in Custom TCP Rule for the port 4222, port 4223, port 4224 with just to be sure i not missing out. Retry above step with same issue with nats/$vmId} failing for same process "metrics-discovery-registrar" error

AwsSecurityGroupInboundRule

Welcome anyone who can help

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this. Unfortunately, the Pivotal Tracker project is private so you may be unable to view the contents of the story.

The labels on this github issue will be updated when the story is started.

mlimonczenko commented 3 years ago

Hello @junetam11,

We were unable to prioritize this request at the time the issue was filed.

If this issue is still relevant, submit a new pull request (preferred) or a new GitHub issue.

I am closing this request. Thank you so much for your contribution.