cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 212 forks source link

Failed to start application if diego is enabled. #86

Closed samdai closed 9 years ago

samdai commented 9 years ago

I followed the readme of https://github.com/cloudfoundry-incubator/diego-release/ to deploy diego to bosh-lite. When I push a java application to Diego. If I don't enable diego, the application can be started, for example "cf push test -p ./test.war", but if I enable diego, can't start the application. my command is as following:

  1. cf push test -p ./test.war --no-start
  2. cf enable-diego test
  3. cf start test The error message is as following: Starting app test in org diego / space diego as admin... Creating container Failed to create container FAILED StagingError TIP: use 'cf logs test --recent' for more information

    The environment is as following: The version of my machine's operation system is OS X Yosemite 10.10.5. The version of cf-release is V220. The version of diego-release is v0.1434.0

cf-gitbot commented 9 years ago

We have created an issue in Pivotal Tracker to manage this. You can view the current status of your issue at: https://www.pivotaltracker.com/story/show/105728076.

wdxxs2z commented 9 years ago

Hi @samdai

You can try,use "cf push app_name -p app_path -b java_buildpack --no-start",then enable diego and start.

Or,you can make yourself dockerfile and push it to docker hub then use "cf docker-push app_name user/app:tag ".

emalm commented 9 years ago

Hi, @samdai,

Based on the Failed to create container error message, it looks like you are running into an issue just trying to create a container on the Diego cell in your BOSH-Lite deployment. If the problem persists, could you extract some of the logs from the cell right after one of the failures? Running bosh logs cell_z1 0 should capture all of the relevant log files.

Thanks, Eric, CF Runtime Diego PM

samdai commented 9 years ago

wdxxs2z: I have tried "cf push app_name -p app_path -b java_buildpack --no-start", got the same error. Eric: the result of bosh logs cell_z1 0 is as below: Acting as user 'admin' on deployment 'cf-warden-diego' on 'Bosh Lite Director' Error 70001: `cf-warden-diego/cell_z1/0' doesn't exist

emalm commented 9 years ago

Hi, @samdai,

Ok, sounds like there's some issue with your Diego BOSH deployment. Could you provide the output of bosh vms from your BOSH-Lite? If you run bosh cloudcheck, it may also find and allow you to correct any missing VMs in your deployment.

Thanks, Eric

samdai commented 9 years ago

Hi Eric, The output of 'bosh vms' is as below: Acting as user 'admin' on 'Bosh Lite Director' Deployment `cf-warden'

Director task 17

Task 17 done

+------------------------------------+---------+---------------+--------------+ | Job/index | State | Resource Pool | IPs | +------------------------------------+---------+---------------+--------------+ | api_z1/0 | running | large_z1 | 10.244.0.134 | | consul_z1/0 | running | small_z1 | 10.244.0.54 | | doppler_z1/0 | running | medium_z1 | 10.244.0.142 | | etcd_z1/0 | running | medium_z1 | 10.244.0.42 | | ha_proxy_z1/0 | running | router_z1 | 10.244.0.34 | | hm9000_z1/0 | running | medium_z1 | 10.244.0.138 | | loggregator_trafficcontroller_z1/0 | running | small_z1 | 10.244.0.146 | | nats_z1/0 | running | medium_z1 | 10.244.0.6 | | postgres_z1/0 | running | medium_z1 | 10.244.0.30 | | router_z1/0 | running | router_z1 | 10.244.0.22 | | runner_z1/0 | running | runner_z1 | 10.244.0.26 | | uaa_z1/0 | running | medium_z1 | 10.244.0.130 | +------------------------------------+---------+---------------+--------------+

VMs total: 12 Deployment `cf-warden-diego'

Director task 18

Task 18 done

+----------------+---------+---------------+--------------+ | Job/index | State | Resource Pool | IPs | +----------------+---------+---------------+--------------+ | colocated_z1/0 | running | colocated_z1 | 10.244.16.46 | +----------------+---------+---------------+--------------+

VMs total: 1

I also ran 'bosh cloudcheck", hasn't found problem, the result is as below:

Acting as user 'admin' on deployment 'cf-warden-diego' on 'Bosh Lite Director' Performing cloud check...

Director task 19 Started scanning 1 vms Started scanning 1 vms > Checking VM states. Done (00:00:00) Started scanning 1 vms > 1 OK, 0 unresponsive, 0 missing, 0 unbound, 0 out of sync. Done (00:00:00) Done scanning 1 vms (00:00:00)

Started scanning 1 persistent disks Started scanning 1 persistent disks > Looking for inactive disks. Done (00:00:00) Started scanning 1 persistent disks > 1 OK, 0 missing, 0 inactive, 0 mount-info mismatch. Done (00:00:00) Done scanning 1 persistent disks (00:00:00)

Task 19 done

Started 2015-10-17 01:31:30 UTC Finished 2015-10-17 01:31:30 UTC Duration 00:00:00

Scan is complete, checking if any problems found... No problems found

emalm commented 9 years ago

Hi, @samdai,

I see, you've deployed Diego with all the services collocated on one VM, so you've probably generated your manifest using the colocated instance count stub at https://github.com/cloudfoundry-incubator/diego-release/blob/develop/manifest-generation/bosh-lite-stubs/colocated-instance-count-overrides.yml. Since you're deploying to BOSH-Lite, where there isn't a strong cost to deploying more 'VMs', I would instead recommend you deploy with services grouped in separate VMs, as is provided by the 'standard' instance count stub at https://github.com/cloudfoundry-incubator/diego-release/blob/develop/manifest-generation/bosh-lite-stubs/instance-count-overrides.yml. This is also the stub used by default in the generate-bosh-lite-manifests script in this repository.

That said, I think I've found why you're failing to create a container: the list of job templates used on the 'colocated' VM is https://github.com/cloudfoundry-incubator/diego-release/blob/develop/manifest-generation/diego.yml#L828-L858, and it omits the 'rootfses' job template that's included on the 'cell' VM type in https://github.com/cloudfoundry-incubator/diego-release/blob/develop/manifest-generation/diego.yml#L786-L787. As a result, when you push the WAR file as a buildpack-based CF app, the Diego components on that 'colocated' VM can't find the rootfs for the cflinuxfs2 stack, and so they fail to create the container.

If you'd like to try fixing your current deployment, you can try adding that 'rootfses' job template to the list of job templates for the 'colocated_z1' job in your current deployment manifest. I've noticed at least one other problem with that colocated VM type, and in general it's not something we're investing in testing in an automated way, so I've created https://www.pivotaltracker.com/story/show/105911952 to remove it from the manifest-generation templates.

Thanks, Eric

samdai commented 9 years ago

Hi Eric, I changed to use the 'standard' instance count stub, but still failed to start application. When I tried

  1. cf push test -p ./test.war --no-start
  2. cf enable-diego test
  3. cf start test The error message is as following:
Starting app test in org diego / space diego as admin...
Creating container
Successfully created container
Downloading app package...
Downloaded app package (885.7K)
No buildpack specified; fetching standard buildpacks to detect and build your application.
Downloading buildpacks (staticfile_buildpack, java_buildpack, ruby_buildpack, nodejs_buildpack, go_buildpack, python_buildpack, php_buildpack, binary_buildpack)...
Downloading python_buildpack...
...
Downloading java_buildpack...
Downloaded java_buildpack
...
Downloaded buildpacks
Staging...
Staging failed

When I execute bosh logs cell_z1 0, the error message in garden.stderr.log is

df: Warning: cannot read table of mounted file systems: No such file or directory

When I execute cf push test -p ./test.war, can start application, but diego is not enabled. the message is as below:

Starting app test in org diego / space diego as admin...
-----> Downloaded app package (888K)
-----> Java Buildpack Version: v3.2 | https://github.com/cloudfoundry/java-buildpack.git#3b68024
-----> Downloading Open Jdk JRE 1.8.0_60 from https://download.run.pivotal.io/openjdk/trusty/x86_64/openjdk-1.8.0_60.tar.gz (1m 6s)
       Expanding Open Jdk JRE to .java-buildpack/open_jdk_jre (0.8s)
-----> Downloading Open JDK Like Memory Calculator 2.0.0_RELEASE from https://download.run.pivotal.io/memory-calculator/trusty/x86_64/memory-calculator-2.0.0_RELEASE.tar.gz (7.3s)
       Memory Settings: -Xms160M -XX:MetaspaceSize=64M -Xss853K -Xmx160M -XX:MaxMetaspaceSize=64M
-----> Downloading Tomcat Instance 8.0.27 from https://download.run.pivotal.io/tomcat/tomcat-8.0.27.tar.gz (10.8s)
       Expanding Tomcat to .java-buildpack/tomcat (0.1s)
-----> Downloading Tomcat Lifecycle Support 2.4.0_RELEASE from https://download.run.pivotal.io/tomcat-lifecycle-support/tomcat-lifecycle-support-2.4.0_RELEASE.jar (0.3s)
-----> Downloading Tomcat Logging Support 2.4.0_RELEASE from https://download.run.pivotal.io/tomcat-logging-support/tomcat-logging-support-2.4.0_RELEASE.jar (0.2s)
-----> Downloading Tomcat Access Logging Support 2.4.0_RELEASE from https://download.run.pivotal.io/tomcat-access-logging-support/tomcat-access-logging-support-2.4.0_RELEASE.jar (0.2s)

-----> Uploading droplet (52M)

1 of 1 instances running

App started

OK

App test was started using this command `CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-2.0.0_RELEASE -memorySizes=metaspace:64m.. -memoryWeights=heap:75,metaspace:10,native:10,stack:5 -memoryInitials=heap:100%,metaspace:100% -totMemory=$MEMORY_LIMIT) &&  JAVA_HOME=$PWD/.java-buildpack/open_jdk_jre JAVA_OPTS="-Djava.io.tmpdir=$TMPDIR -XX:OnOutOfMemoryError=$PWD/.java-buildpack/open_jdk_jre/bin/killjava.sh $CALCULATED_MEMORY -Daccess.logging.enabled=false -Dhttp.port=$PORT" $PWD/.java-buildpack/tomcat/bin/catalina.sh run`

Showing health and status for app test in org diego / space diego as admin...
OK

requested state: started
instances: 1/1
usage: 256M x 1 instances
urls: test.bosh-lite.com
last uploaded: Mon Oct 19 00:37:35 UTC 2015
stack: cflinuxfs2
buildpack: java-buildpack=v3.2-https://github.com/cloudfoundry/java-buildpack.git#3b68024 open-jdk-like-jre=1.8.0_60 open-jdk-like-memory-calculator=2.0.0_RELEASE tomcat-access-logging-support=2.4.0_RELEASE tomcat-instance=8.0.27 tomcat-lifecycle-support=2.4.0_RE...

     state     since                    cpu    memory           disk      details   
#0   running   2015-10-19 08:43:55 AM   0.0%   142.4M of 256M   0 of 1G 
Amit-PivotalLabs commented 9 years ago

Couple questions/suggestions:

  1. Can you try pushing the app with the "-b java_buildpack" flag? This will make the trial-and-error cycle a bit faster, and eliminate several possible sources of error (e.g. it eliminates the possibility that the error is in other buildpacks, or in the detect phase of the java_buildpack).
  2. How much time do you typically see between "Staging..." and "Staging failed"?
  3. Immediately after it fails, can you do "cf logs test --recent" to get more detailed logs from the failing staging process?
  4. Immediately after it fails, fetch "bosh logs cell_z1 0". There are many files there in addition to garden logs. I suspect there will be error information around the time of the staging failure in many log files, and each one may have multiple error lines in addition to the "df" line you showed. For example there might be something indicating what file/directory it was looking for that led to the "No such file or directory" error. Can you look for more logs in more files relevant to the failure, and timestamped around the time of the failure, so we can get more contextual information?

On Sun, Oct 18, 2015 at 6:00 PM, sam notifications@github.com wrote:

Hi Eric, I changed to use the 'standard' instance count stub, but still failed to start application. When I tried

  1. cf push test -p ./test.war --no-start
  2. cf enable-diego test
  3. cf start test The error message is as following:

Starting app test in org diego / space diego as admin... Creating container Successfully created container Downloading app package... Downloaded app package (885.7K) No buildpack specified; fetching standard buildpacks to detect and build your application. Downloading buildpacks (staticfile_buildpack, java_buildpack, ruby_buildpack, nodejs_buildpack, go_buildpack, python_buildpack, php_buildpack, binary_buildpack)... Downloading python_buildpack... ... Downloading java_buildpack... Downloaded java_buildpack ... Downloaded buildpacks Staging... Staging failed

When I execute bosh logs cell_z1 0, the error message in garden.stderr.log is

df: Warning: cannot read table of mounted file systems: No such file or directory

When I execute cf push test -p ./test.war, can start application, but diego is not enabled. the message is as below:

Starting app test in org diego / space diego as admin... -----> Downloaded app package (888K) -----> Java Buildpack Version: v3.2 | https://github.com/cloudfoundry/java-buildpack.git#3b68024 -----> Downloading Open Jdk JRE 1.8.0_60 from https://download.run.pivotal.io/openjdk/trusty/x86_64/openjdk-1.8.0_60.tar.gz (1m 6s) Expanding Open Jdk JRE to .java-buildpack/open_jdk_jre (0.8s) -----> Downloading Open JDK Like Memory Calculator 2.0.0_RELEASE from https://download.run.pivotal.io/memory-calculator/trusty/x86_64/memory-calculator-2.0.0_RELEASE.tar.gz (7.3s) Memory Settings: -Xms160M -XX:MetaspaceSize=64M -Xss853K -Xmx160M -XX:MaxMetaspaceSize=64M -----> Downloading Tomcat Instance 8.0.27 from https://download.run.pivotal.io/tomcat/tomcat-8.0.27.tar.gz (10.8s) Expanding Tomcat to .java-buildpack/tomcat (0.1s) -----> Downloading Tomcat Lifecycle Support 2.4.0_RELEASE from https://download.run.pivotal.io/tomcat-lifecycle-support/tomcat-lifecycle-support-2.4.0_RELEASE.jar (0.3s) -----> Downloading Tomcat Logging Support 2.4.0_RELEASE from https://download.run.pivotal.io/tomcat-logging-support/tomcat-logging-support-2.4.0_RELEASE.jar (0.2s) -----> Downloading Tomcat Access Logging Support 2.4.0_RELEASE from https://download.run.pivotal.io/tomcat-access-logging-support/tomcat-access-logging-support-2.4.0_RELEASE.jar (0.2s)

-----> Uploading droplet (52M)

1 of 1 instances running

App started

OK

App test was started using this command CALCULATED_MEMORY=$($PWD/.java-buildpack/open_jdk_jre/bin/java-buildpack-memory-calculator-2.0.0_RELEASE -memorySizes=metaspace:64m.. -memoryWeights=heap:75,metaspace:10,native:10,stack:5 -memoryInitials=heap:100%,metaspace:100% -totMemory=$MEMORY_LIMIT) && JAVA_HOME=$PWD/.java-buildpack/open_jdk_jre JAVA_OPTS="-Djava.io.tmpdir=$TMPDIR -XX:OnOutOfMemoryError=$PWD/.java-buildpack/open_jdk_jre/bin/killjava.sh $CALCULATED_MEMORY -Daccess.logging.enabled=false -Dhttp.port=$PORT" $PWD/.java-buildpack/tomcat/bin/catalina.sh run

Showing health and status for app test in org diego / space diego as admin... OK

requested state: started instances: 1/1 usage: 256M x 1 instances urls: test.bosh-lite.com last uploaded: Mon Oct 19 00:37:35 UTC 2015 stack: cflinuxfs2 buildpack: java-buildpack=v3.2-https://github.com/cloudfoundry/java-buildpack.git#3b68024 open-jdk-like-jre=1.8.0_60 open-jdk-like-memory-calculator=2.0.0_RELEASE tomcat-access-logging-support=2.4.0_RELEASE tomcat-instance=8.0.27 tomcat-lifecycle-support=2.4.0_RE...

 state     since                    cpu    memory           disk      details

0 running 2015-10-19 08:43:55 AM 0.0% 142.4M of 256M 0 of 1G

— Reply to this email directly or view it on GitHub https://github.com/cloudfoundry-incubator/diego-release/issues/86#issuecomment-149070457 .

emalm commented 9 years ago

Hi, @samdai,

Thanks for the additional output! As Amit suggests, it would be interesting to see if there are other errors from the same time period in the rep and garden logs (that would be the only other component on the cell that I would expect to yield useful information). Those errors would likely be found in the 'garden/garden.stdout.log' and 'rep/rep.stdout.log' log files. Timestamps from the cf logs output for your app also may make it easier to find something relevant in the rep and garden logs about the failure, and the veritas tool at https://github.com/pivotal-cf-experimental/veritas is also useful for filtering and pretty-printing the logs.

Also, what's the version of your BOSH-Lite vagrant box, and what is the value of the garden.mount_btrfs_loopback property in your Diego deployment manifest? The df warning message makes me suspect you might be seeing an unexpected interaction between the BOSH-Lite box and how garden-linux sets up storage for the container layers.

Thanks, Eric

samdai commented 9 years ago

Hi Amit: Answer for your questions: (1) I have tried pushing the app with the "-b java_buildpack" flag, go the same error:

Staging failed

FAILED
StagingError

(2)About 2 seconds to see between "Staging..." and "Staging failed", and 9 minutes to see between "Staging failed" and "FAILED"

(3)The result of cf logs test --recent immediately after it fails is as below:

2015-10-19T14:16:26.31+0800 [API/0]      OUT Created app with guid 655af96c-3f9b-494e-8654-1fc2d0cdb1cc
2015-10-19T14:16:26.40+0800 [API/0]      OUT Updated app with guid 655af96c-3f9b-494e-8654-1fc2d0cdb1cc ({"route"=>"b3ef9e45-e6a7-4210-ad1a-dc2d05939b8e"})
2015-10-19T14:16:38.24+0800 [API/0]      OUT Updated app with guid 655af96c-3f9b-494e-8654-1fc2d0cdb1cc ({"diego"=>true})
2015-10-19T14:16:42.05+0800 [API/0]      OUT Updated app with guid 655af96c-3f9b-494e-8654-1fc2d0cdb1cc ({"state"=>"STARTED"})
2015-10-19T14:16:42.07+0800 [STG/0]      OUT Creating container
2015-10-19T14:16:54.42+0800 [STG/0]      OUT Successfully created container
2015-10-19T14:16:54.42+0800 [STG/0]      OUT Downloading app package...
2015-10-19T14:16:54.50+0800 [STG/0]      OUT Downloaded app package (885.7K)
2015-10-19T14:16:54.50+0800 [STG/0]      OUT Downloading buildpacks (java_buildpack)...
2015-10-19T14:16:54.50+0800 [STG/0]      OUT Downloading java_buildpack...
2015-10-19T14:16:54.54+0800 [STG/0]      OUT Downloaded java_buildpack
2015-10-19T14:16:54.61+0800 [STG/0]      OUT Downloaded buildpacks
2015-10-19T14:16:54.61+0800 [STG/0]      OUT Staging...
2015-10-19T14:16:54.62+0800 [STG/0]      ERR Staging failed
2015-10-19T14:23:43.59+0800 [API/0]      ERR Failed to stage application: staging failed

(4)After execute bosh logs cell_z1 0, only found the following related message in 'rep/rep.stdout.log' log file:

{"timestamp":"1445235628.224967241","source":"rep","message":"rep.container-metrics-reporter.depot-client.get-all-metrics.infos.failed-bulk-info","log_level":2,"data":{"error":"json: cannot unmarshal string into Go value of type uint32","handles":["655af96c-3f9b-494e-8654-1fc2d0cdb1cc-430ecf0b8e5c44ce87baf9dd2ca93028"],"session":"4.1.1742.1"}}
{"timestamp":"1445235628.224999189","source":"rep","message":"rep.container-metrics-reporter.depot-client.get-all-metrics.failed-to-list-containers","log_level":2,"data":{"error":"json: cannot unmarshal string into Go value of type uint32","session":"4.1.1742"}}

And the version of my vagrant is 1.7.4. in the file manifest-generation/bosh-lite-stubs/property-overrides.yml, the value of property garden.mount_btrfs_loopback is as following:

  garden:
    listen_network: tcp
    listen_address: 0.0.0.0:7777
    mount_btrfs_loopback: false
    disk_quota_enabled: false
emalm commented 9 years ago

Hi, @samdai,

Thanks for providing the extra information, especially the logs from the rep. Which version of the garden-linux release do you have deployed? You should be able to see it listed when you run bosh deployments. I suspect that you have something later than version 0.307.0 deployed, as there was a breaking change to the Garden API introduced in 0.308.0 and later that would be causing those error messages. If that's the case, could you please upload version 0.307.0 as directed at http://bosh.io/releases/github.com/cloudfoundry-incubator/garden-linux-release?version=0.307.0, specify that version in the 'releases' section of your Diego deployment manifest, and redeploy Diego?

Best, Eric

samdai commented 9 years ago

Hi Eric, The version of my garden-linux release is 0.308.0, after I changed to version 0.307.0 , can start application successfully. Thank you very much for your help. also very thank wdxxs2z and Amit fo your help.

emalm commented 9 years ago

Hi, @samdai,

Great, glad to hear that the correct release resolved the problem! For each final release of Diego, we do publish the versions of the garden-linux and now etcd releases that are known to be compatible with it, but that can be difficult to find or to enforce in your deployment manifest. We'll be working on ways to make that information more discoverable and to automate uploading and deploying the correct combination of compatible releases.

Thanks again, Eric