cloudfoundry / cli

The official command line client for Cloud Foundry
https://docs.cloudfoundry.org/cf-cli
Apache License 2.0
1.75k stars 929 forks source link

cf cli panic #509

Closed mavenraven closed 9 years ago

mavenraven commented 9 years ago
Aww shucks.

Something completely unexpected happened. This is a bug in cf.
Please file this bug : https://github.com/cloudfoundry/cli/issues
Tell us that you ran this command:

    cf start 7299014a-12b4-4c36-4419-b7daa7124395

using this version of the CLI:

    6.11.3-cebadc9

and that this error occurred:

    runtime error: index out of range

and this stack trace:

    goroutine 1 [running]:
main.generateBacktrace(0xc21000a000, 0x3)
    /Users/pivotal/go-agent/pipelines/Mac-OSX-Unit-Tests/src/github.com/cloudfoundry/cli/main/main.go:177 +0xa9
main.handlePanics(0x10c45d8, 0xc2100685a0)
    /Users/pivotal/go-agent/pipelines/Mac-OSX-Unit-Tests/src/github.com/cloudfoundry/cli/main/main.go:161 +0x123
runtime.panic(0x570540, 0xe7c917)
    /usr/local/go/src/pkg/runtime/panic.c:248 +0x106
github.com/cloudfoundry/cli/cf/api.(*logNoaaRepository).GetContainerMetrics(0xc2100bcd20, 0xc21044b1e0, 0x24, 0x0, 0x0, ...)
    /Users/pivotal/go-agent/pipelines/Mac-OSX-Unit-Tests/src/github.com/cloudfoundry/cli/tmp/cli_gopath/src/github.com/cloudfoundry/cli/cf/api/logs_noaa.go:68 +0x275
github.com/cloudfoundry/cli/cf/commands/application.(*ShowApp).ShowApp(0xc2103400c0, 0xc21044b1e0, 0x24, 0xc21044b270, 0x24, ...)
    /Users/pivotal/go-agent/pipelines/Mac-OSX-Unit-Tests/src/github.com/cloudfoundry/cli/tmp/cli_gopath/src/github.com/cloudfoundry/cli/cf/commands/application/app.go:112 +0x683
github.com/cloudfoun
cf-gitbot commented 9 years ago

We have created an issue in Pivotal Tracker to manage this. You can view the current status of your issue at: https://www.pivotaltracker.com/story/show/98672332.

goehmen commented 9 years ago

@mavenraven it looks like a component of logging is not running properly in your env. Could be doppler but not sure. it would be the component that is responsible for providing the3metric that the CLI expects to consume when the start command runs.

santuari commented 9 years ago

We have installed CF v212 on top of Openstack 6.0. Cloudfoundry has been installed with bosh-init. All the VMs are up and running and bosh cloudcheck reports 0 errors. We are not able to push applications (see the log)

Could be the same bug?

simonleung8 commented 9 years ago

@santuari , looks like the panic you are getting is unrelated to this bug, at least from what I could see in the log. You are using v6.11.0, try updating to the latest CLI, we have fixed a few bugs related to getting app logs.

santuari commented 9 years ago

@simonleung8 I have updated CF CLI: pushing an application fails with a different error and after that the cf apps command fails (see log). Thank you very much

jpalermo commented 9 years ago

Hi @santuari

The "Staging error: no available stagers" error can happen if you do not have your dea capacity configured in your deployment manifest. You can see the template here. And the description of what the properties do here. Specific properties you should check are disk_mb and memory_mb.

The second error, when doing a cf apps, looks like it might be happening when the API machine tries to contact the Health Manager to check on instance statuses. To make the request, the API must be able to access http://hm9000.172.16.0.191.xip.io.

If the API does have access, looking at the cloud_controller logs for more details on the error would probably help.

santuari commented 9 years ago

Hi @jpalermo, I am trying to debug the problem in this week. disk_mb and memory_mb are correctly configured in the manifest. I do not think that the problem is related to DEA resources. In the CC, when I push a new application, I see and error related to the creation of the routes (see here). Also the routes created for the applications seems to be orphaned. I do not see any error in the DEA logs. I am currently re-deploying all the Cloud Foundry installation without changing anything to be sure that is not a problem of the deployment. If the new deployment does not fix the problem, I will destroy also bosh-init and update stemcell, bosh-cpi, cf-release... to last version and deploy again Cloud Foundry. Do you have additional hints before destroying bosh-init? Thank you,

jpalermo commented 9 years ago

That error does not look like it is during route creation, it is during DEA placement of the staging task. The error happens when it can't find a DEA to stage the app on. This is either because there are none, or because the ones that are there do not have enough memory or disk to stage the app.

Since you said you checked the disk_mb and memory_mb, I'd guess that the API instance is having trouble registering the DEA instance. This is probably due to some sort of connectivity failure, they communicate over the NATS message bus.

Is it possible the network connection between internal components could be getting blocked?

santuari commented 9 years ago

@jpalermo: Thank you for the answer. It is very strange that the connection is getting blocked. I configured Openstack to permit all the traffic. I am trying to deploy the last version of CF. I will see if I face the same issues.

santuari commented 9 years ago

@jpalermo: new updated cloudfoundry, but I have the same errors. I tried to communicate on a random port using nc command between different CF VMs and it works. I really do not know how to solve this issue.

jpalermo commented 9 years ago

Hmm, ok, lets try to see if the Api instance is getting the correct NATS messages. You'll need to SSH onto an API instance, and then run the commands below. This should subscribe you to all NATS messages.

Look for the dea.advertise messages. If you see them (and they should happen every 5-10 seconds or so), they should include stacks, available_memory and available_disk attributes in the message. Let me know what you see there.

# Set the correct ruby binary on the path
source /var/vcap/jobs/cloud_controller_ng/bin/ruby_version.sh

# Find a nats server address and port
grep -A3 nats /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml

# Subscribe to all nats messages
GEM_PATH=/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.1.0 /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.1.0/bin/nats-sub ">" -s REPLACE_WITH_NATS_SERVER_ADDRESS_AND_PORT
santuari commented 9 years ago

@jpalermo: I can see: [#13] Received on [dea.advertise] : '{"id":"0-df755921117f4d97bb8a39aff924f453","stacks":["cflinuxfs2"],"available_memory":24000,"available_disk":40000,"app_id_to_count":{},"placement_properties":{"zone":"z1"}}'

In the consul_agent.stdout.log (of all the VMs I think) I see the following error: [ERR] agent: failed to sync remote state: No known Consul servers Also in bosh-lite I have the same error, but it is working fine.

Now I am getting a different error when I push applications: FAILED StagingError Full trace here.

jpalermo commented 9 years ago

So this time it looks like it was able to find the DEA to stage the app on, but staging failed for some reason.

Normally the reason for the failure would show up, but for some reason the request to the logging endpoint to get the staging logs returned a 401 error.

The logging team says that looking at the logs for the loggregator_trafficcontroller instance should help diagnose why the logging endpoint returned a 401.

santuari commented 9 years ago

This is the log of the loggregator_trafficcontroller when I issue the cf push command. I also link my cf-stub.yml used to generate the manifest. I tried to change the network adding dns to reach internet, but the changes do not fix the problem. @jpalermo: thank you very much for your help.

jpalermo commented 9 years ago

Sorry, I forgot to mention, the consul_agent errors are expected. There are no consul servers enabled by default, but neither is there anything that requires them by default. The agents will have errors, but that is ok.

The loggregator error looks like it is trying to validate the ssl cert that the ha_proxy instance has, which I assume is a self signed cert. You'll need to add this to your stub to disable ssl verification:

properties:
  ssl:
    skip_cert_verify: true
santuari commented 9 years ago

@jpalermo: thank you! Now the infrastructure is working.