Closed mavenraven closed 9 years ago
We have created an issue in Pivotal Tracker to manage this. You can view the current status of your issue at: https://www.pivotaltracker.com/story/show/98672332.
@mavenraven it looks like a component of logging is not running properly in your env. Could be doppler but not sure. it would be the component that is responsible for providing the3metric that the CLI expects to consume when the start command runs.
We have installed CF v212 on top of Openstack 6.0. Cloudfoundry has been installed with bosh-init. All the VMs are up and running and bosh cloudcheck reports 0 errors. We are not able to push applications (see the log)
Could be the same bug?
@santuari , looks like the panic you are getting is unrelated to this bug, at least from what I could see in the log. You are using v6.11.0, try updating to the latest CLI, we have fixed a few bugs related to getting app logs.
@simonleung8 I have updated CF CLI: pushing an application fails with a different error and after that the cf apps command fails (see log). Thank you very much
Hi @santuari
The "Staging error: no available stagers" error can happen if you do not have your dea capacity configured in your deployment manifest. You can see the template here. And the description of what the properties do here. Specific properties you should check are disk_mb
and memory_mb
.
The second error, when doing a cf apps
, looks like it might be happening when the API machine tries to contact the Health Manager to check on instance statuses. To make the request, the API must be able to access http://hm9000.172.16.0.191.xip.io
.
If the API does have access, looking at the cloud_controller logs for more details on the error would probably help.
Hi @jpalermo,
I am trying to debug the problem in this week.
disk_mb
and memory_mb
are correctly configured in the manifest. I do not think that the problem is related to DEA resources.
In the CC, when I push a new application, I see and error related to the creation of the routes (see here). Also the routes created for the applications seems to be orphaned. I do not see any error in the DEA logs.
I am currently re-deploying all the Cloud Foundry installation without changing anything to be sure that is not a problem of the deployment.
If the new deployment does not fix the problem, I will destroy also bosh-init and update stemcell, bosh-cpi, cf-release... to last version and deploy again Cloud Foundry.
Do you have additional hints before destroying bosh-init?
Thank you,
That error does not look like it is during route creation, it is during DEA placement of the staging task. The error happens when it can't find a DEA to stage the app on. This is either because there are none, or because the ones that are there do not have enough memory or disk to stage the app.
Since you said you checked the disk_mb and memory_mb, I'd guess that the API instance is having trouble registering the DEA instance. This is probably due to some sort of connectivity failure, they communicate over the NATS message bus.
Is it possible the network connection between internal components could be getting blocked?
@jpalermo: Thank you for the answer. It is very strange that the connection is getting blocked. I configured Openstack to permit all the traffic. I am trying to deploy the last version of CF. I will see if I face the same issues.
@jpalermo: new updated cloudfoundry, but I have the same errors.
I tried to communicate on a random port using nc
command between different CF VMs and it works.
I really do not know how to solve this issue.
Hmm, ok, lets try to see if the Api instance is getting the correct NATS messages. You'll need to SSH onto an API instance, and then run the commands below. This should subscribe you to all NATS messages.
Look for the dea.advertise
messages. If you see them (and they should happen every 5-10 seconds or so), they should include stacks
, available_memory
and available_disk
attributes in the message. Let me know what you see there.
# Set the correct ruby binary on the path
source /var/vcap/jobs/cloud_controller_ng/bin/ruby_version.sh
# Find a nats server address and port
grep -A3 nats /var/vcap/jobs/cloud_controller_ng/config/cloud_controller_ng.yml
# Subscribe to all nats messages
GEM_PATH=/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.1.0 /var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/2.1.0/bin/nats-sub ">" -s REPLACE_WITH_NATS_SERVER_ADDRESS_AND_PORT
@jpalermo: I can see:
[#13] Received on [dea.advertise] : '{"id":"0-df755921117f4d97bb8a39aff924f453","stacks":["cflinuxfs2"],"available_memory":24000,"available_disk":40000,"app_id_to_count":{},"placement_properties":{"zone":"z1"}}'
In the consul_agent.stdout.log
(of all the VMs I think) I see the following error:
[ERR] agent: failed to sync remote state: No known Consul servers
Also in bosh-lite I have the same error, but it is working fine.
Now I am getting a different error when I push applications:
FAILED StagingError
Full trace here.
So this time it looks like it was able to find the DEA to stage the app on, but staging failed for some reason.
Normally the reason for the failure would show up, but for some reason the request to the logging endpoint to get the staging logs returned a 401 error.
The logging team says that looking at the logs for the loggregator_trafficcontroller instance should help diagnose why the logging endpoint returned a 401.
This is the log of the loggregator_trafficcontroller when I issue the cf push
command.
I also link my cf-stub.yml used to generate the manifest. I tried to change the network adding dns to reach internet, but the changes do not fix the problem.
@jpalermo: thank you very much for your help.
Sorry, I forgot to mention, the consul_agent errors are expected. There are no consul servers enabled by default, but neither is there anything that requires them by default. The agents will have errors, but that is ok.
The loggregator error looks like it is trying to validate the ssl cert that the ha_proxy instance has, which I assume is a self signed cert. You'll need to add this to your stub to disable ssl verification:
properties:
ssl:
skip_cert_verify: true
@jpalermo: thank you! Now the infrastructure is working.