cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 212 forks source link

[Executor] Send error to app logs if starting a container fails #892

Closed klapkov closed 6 months ago

klapkov commented 8 months ago

Send error to app logs if starting a container fails

Summary

Recently we noticed that when a container fails to start because it exceeds the env var size limit, the error is not logged in the app logs.

Staging app and tracing logs...                                                                                                                                         
   Cell fcb7c7db-8a2c-4216-95d3-a7c6500c8e9c creating container for instance cbe0160f-28b3-47b7-8982-8f123d6515c7                                                       
   Security group rules were updated                                                                                                                                    
   Cell fcb7c7db-8a2c-4216-95d3-a7c6500c8e9c successfully created container for instance cbe0160f-28b3-47b7-8982-8f123d6515c7                                           
StagingError - Staging error: staging failed                                                                                                                            
FAILED                                                                                                                                                                  
[root@iacbox-v659]:/users/i583051/landscape-lod-aws-0103/tools/overloader/mentos-cola$ cf logs --recent mentos-cola1                                                    
Retrieving logs for app mentos-cola1 in org batman / space ss as provisioned_user_cf_admin...                                                                           

   2024-01-16T22:49:02.49+0000 [API/0] OUT Added process: "web"                                                                                                         
   2024-01-16T22:49:02.51+0000 [API/0] OUT Created app with guid fae58650-79e9-4c5a-9ccc-f9dfbe5cbcdd                                                                   
   2024-01-16T22:49:02.56+0000 [API/0] OUT Applied manifest to app with guid fae58650-79e9-4c5a-9ccc-f9dfbe5cbcdd (---                                                  
   2024-01-16T22:49:02.56+0000 [API/0] OUT applications:                                                                                                                
   2024-01-16T22:49:02.57+0000 [API/0] OUT - name: mentos-cola1                                                                                                         
   2024-01-16T22:49:02.57+0000 [API/0] OUT disk-quota: 4G                                                                                                               
   2024-01-16T22:49:02.57+0000 [API/0] OUT memory: 500M                                                                                                                 
   2024-01-16T22:49:02.57+0000 [API/0] OUT random-route: true                                                                                                           
   2024-01-16T22:49:02.57+0000 [API/0] OUT buildpack: https://github.com/cloudfoundry/go-buildpack.git                                                                  
   2024-01-16T22:49:02.57+0000 [API/0] OUT env: "[PRIVATE DATA HIDDEN]"                                                                                                 
   2024-01-16T22:49:02.57+0000 [API/0] OUT )                                                                                                                            
   2024-01-16T22:49:03.55+0000 [API/0] OUT Scaling process: "web"                                                                                                       
   2024-01-16T22:49:03.60+0000 [API/0] OUT Updated app with guid fae58650-79e9-4c5a-9ccc-f9dfbe5cbcdd ({"lifecycle"=>{"data"=>{"buildpacks"=>["https://github.com/cloudf
oundry/go-buildpack.git"]}}})                                                                                                                                           
   2024-01-16T22:49:03.75+0000 [API/0] OUT Updated app with guid fae58650-79e9-4c5a-9ccc-f9dfbe5cbcdd ({"environment_variables"=>"[PRIVATE DATA HIDDEN]"})              
   2024-01-16T22:49:05.92+0000 [API/0] OUT Uploading app package for app with guid fae58650-79e9-4c5a-9ccc-f9dfbe5cbcdd                                                 
   2024-01-16T22:49:12.22+0000 [API/0] OUT Creating build for app with guid fae58650-79e9-4c5a-9ccc-f9dfbe5cbcdd                                                        
   2024-01-16T22:49:12.70+0000 [STG/0] OUT Cell fcb7c7db-8a2c-4216-95d3-a7c6500c8e9c creating container for instance cbe0160f-28b3-47b7-8982-8f123d6515c7               
   2024-01-16T22:49:13.20+0000 [STG/0] OUT Security group rules were updated                                                                                            
   2024-01-16T22:49:13.24+0000 [STG/0] OUT Cell fcb7c7db-8a2c-4216-95d3-a7c6500c8e9c successfully created container for instance cbe0160f-28b3-47b7-8982-8f123d6515c7   
   2024-01-16T22:49:13.60+0000 [STG/0] OUT Downloading app package...                                                                                                   
   2024-01-16T22:49:13.66+0000 [STG/0] OUT Downloaded app package (1.1K)                                                                                                
   2024-01-16T22:49:14.39+0000 [STG/0] OUT Cell fcb7c7db-8a2c-4216-95d3-a7c6500c8e9c stopping instance cbe0160f-28b3-47b7-8982-8f123d6515c7                             
   2024-01-16T22:49:14.39+0000 [STG/0] OUT Cell fcb7c7db-8a2c-4216-95d3-a7c6500c8e9c destroying container for instance cbe0160f-28b3-47b7-8982-8f123d6515c7             
   2024-01-16T22:49:14.45+0000 [API/0] ERR Failed to stage build: staging failed                                                                                        
   2024-01-16T22:49:14.95+0000 [STG/0] OUT Cell fcb7c7db-8a2c-4216-95d3-a7c6500c8e9c successfully destroyed container for instance cbe0160f-28b3-47b7-8982-8f123d6515c7

The only error is ERR Failed to stage build: staging failed. This is not ideal, because you cannot know what went wrong until you go digging trough the rep and garden logs for the container. So having a more informative error sent to the app logs would help.

Diego repo

https://github.com/cloudfoundry/executor

Describe alternatives you've considered (optional)

PR

winkingturtle-vmw commented 8 months ago

@klapkov Thanks for this proposal. I've deployed your changes, but not seeing the error show up in cf logs. Are you seeing an error show up? Can you please shared your reproduction steps?

PlamenDoychev commented 7 months ago

Dear Colleagues,

The exact variable size is 131072 (bytes). It is a kernel limitation: [https://github.com/torvalds/linux/blob/18d46e76d7c2eedd8577fae67e3f1d4db25018b0/include/uapi/linux/binfmts.h#L9]

If we exceed this limitation a non user friendly message appears.

Regards, Plamen Doychev

klapkov commented 7 months ago

Hello @winkingturtle-vmw, sorry for the late response. What I did is take one sample app and I generated a large file, larger than the limitation. I attached it to the app as an env variable and did cf push.

winkingturtle-vmw commented 7 months ago

That's what I did too, but didn't see any errors. Can you please provide detailed step-by-step instruction with output snippet for what to expect instead.

Additionally, my understanding from @PlamenDoychev's comment is that this change will not make any difference regardless. Please correct me if that's wrong.

klapkov commented 7 months ago

Hello @winkingturtle-vmw,

Again, sorry for the delayed response. Maybe there are some differences between our setups. If the SuppressLogOutput property is set to true, the log will not be sent. From what I understand, this is being set in the RunStep in the cloud_controller, but I am not sure when it is set to true and to false.

Here is how I reproduce the issue.

  1. Take a sample application.
  2. Add env variable in the manifest - I generated a very large string that is larger than 131kb and directly added it to the manifest.
  3. Execute cf push
  4. The staging task is scheduled and eventually it fails with StagingError - Staging error: staging failed
  5. Executing cf logs --recent gives me this:
Retrieving logs for app mentos-cola in org batman / space ss as provisioned_user_cf_admin...

   2024-02-27T22:14:53.95+0000 [API/0] OUT Added process: "web"
   2024-02-27T22:14:53.97+0000 [API/0] OUT Created app with guid cf07fe68-288a-4f00-8e01-06a3ffdf76d5
   2024-02-27T22:14:53.99+0000 [API/0] OUT Applied manifest to app with guid cf07fe68-288a-4f00-8e01-06a3ffdf76d5 (---
   2024-02-27T22:14:53.99+0000 [API/0] OUT applications:
   2024-02-27T22:14:53.99+0000 [API/0] OUT - name: mentos-cola
   2024-02-27T22:14:53.99+0000 [API/0] OUT disk-quota: 4G
   2024-02-27T22:14:53.99+0000 [API/0] OUT memory: 500M
   2024-02-27T22:14:53.99+0000 [API/0] OUT random-route: true
   2024-02-27T22:14:53.99+0000 [API/0] OUT buildpack: https://github.com/cloudfoundry/go-buildpack.git
   2024-02-27T22:14:53.99+0000 [API/0] OUT env: "[PRIVATE DATA HIDDEN]"
   2024-02-27T22:14:53.99+0000 [API/0] OUT )
   2024-02-27T22:14:54.50+0000 [API/0] OUT Scaling process: "web"
   2024-02-27T22:14:54.55+0000 [API/0] OUT Updated app with guid cf07fe68-288a-4f00-8e01-06a3ffdf76d5 ({"lifecycle"=>{"data"=>{"buildpacks"=>["https://github.com/cloudfoundry/go-buildpack.git"]}}})
   2024-02-27T22:14:54.67+0000 [API/0] OUT Updated app with guid cf07fe68-288a-4f00-8e01-06a3ffdf76d5 ({"environment_variables"=>"[PRIVATE DATA HIDDEN]"})
   2024-02-27T22:14:57.55+0000 [API/0] OUT Uploading app package for app with guid cf07fe68-288a-4f00-8e01-06a3ffdf76d5
   2024-02-27T22:15:03.74+0000 [API/2] OUT Creating build for app with guid cf07fe68-288a-4f00-8e01-06a3ffdf76d5
   2024-02-27T22:15:04.27+0000 [STG/0] OUT Cell bd020a9f-d12b-47bd-ac25-0a3afdc63eed creating container for instance 75baa4d9-c702-4a80-bbe2-165306f5ed8e
   2024-02-27T22:15:04.54+0000 [STG/0] OUT Cell bd020a9f-d12b-47bd-ac25-0a3afdc63eed successfully created container for instance 75baa4d9-c702-4a80-bbe2-165306f5ed8e
   2024-02-27T22:15:05.14+0000 [STG/0] OUT Downloading app package...
   2024-02-27T22:15:05.19+0000 [STG/0] OUT Downloaded app package (118.1K)
   2024-02-27T22:15:05.27+0000 [STG/0] OUT failed-creating-process: runc exec: exit status 255: exec failed: unable to start container process: exec /tmp/lifecycle/builder: argument list too long
   2024-02-27T22:15:05.87+0000 [STG/0] OUT Cell bd020a9f-d12b-47bd-ac25-0a3afdc63eed stopping instance 75baa4d9-c702-4a80-bbe2-165306f5ed8e
   2024-02-27T22:15:05.87+0000 [STG/0] OUT Cell bd020a9f-d12b-47bd-ac25-0a3afdc63eed destroying container for instance 75baa4d9-c702-4a80-bbe2-165306f5ed8e
   2024-02-27T22:15:05.94+0000 [API/2] ERR Failed to stage build: staging failed
   2024-02-27T22:15:06.42+0000 [STG/0] OUT Cell bd020a9f-d12b-47bd-ac25-0a3afdc63eed successfully destroyed container for instance 75baa4d9-c702-4a80-bbe2-165306f5ed8e

Maybe you can try removing the the check for the step.model.SuppressLogOutput and aways send logs. Maybe it is worth discussing if it should be like this anyway.

Regarding Plamen's comment, yes this will not make a difference.

PlamenDoychev commented 7 months ago

Hello @winkingturtle-vmw,

Do you have any feedback on this? Did you manage to reproduce it?

winkingturtle-vmw commented 7 months ago

@PlamenDoychev @klapkov This is the command head -c 45000 <(base64 /dev/urandom ) I am running for generating a large text. I've noticed that when I bump the number from 45000 to 50000, the application is not getting pushed. I am not getting a staging error, this is failing for me at CF API/cloud_controller level.

My follow up questions:

klapkov commented 7 months ago

Hello @winkingturtle-vmw,

I generate a large string with this script:

import random
import string

# Define the desired size in bytes (1 KB = 1024 bytes)
desired_size_kb = 140
desired_size_bytes = desired_size_kb * 1024

# Generate random letters to form the string
random_string = ''.join(random.choice(string.ascii_letters) for _ in range(desired_size_bytes))

# Print the length of the string to verify its size
print("Length of the random string (in bytes):", len(random_string))

# Write the random string to a file
with open('random_string.txt', 'w') as file:
    file.write(random_string)

print("Random string saved to 'random_string.txt'.")

Regarding the real life reason for this, we have received quite a lot of cases where someone exceeds this limit. I do not remember what exactly were those env variables. Debugging this was harder, since there was no app log that indicated that the limit was exceeded. So you would need to dive in the logs of the hosting cell and find the error from the rep.

I wonder if step.model.SuppressLogOutput is set to true on your side. As I mentioned, this is set by the cloud controller and I am not sure when is it set to true and when to false. On our side, seems lime it is aways set to false. This is the only potential thing I see that could cause the error to be hidden.

PlamenDoychev commented 7 months ago

Hi @winkingturtle-vmw,

The use case for this improvement is the following. On our platform we have applications which consumes ~30 services provided by the platform as well. Last year we had an initiative to secure the communication between the provided applications and services using best option possible. Some services implemented mTLS, some regular credential rotation with "blue-green" credentials, etc., hence a lot of x509 certificates and other credentials appeared in the VCAP_SERVICES env var as service credentials are injected there. The problem is that sometimes we receive requests from customers saying that their apps stop working (not being able to stage) out of the sudden without any changes into the code. While debugging such requests we noticed that actually some sort of cred rotation has happened or a new certificate has appeared which lead to exhaustion of the VCAP_SERVICE env var limit (which is by the way documented here: https://docs.cloudfoundry.org/devguide/deploy-apps/environment-variable.html#VCAP-SERVICES). We have an idea and already working on RFC for long term solution for this problem, but until we trigger a discussion about the future we would like at lest to make the app developers live easier by providing them a meaningful error message.

winkingturtle-vmw commented 7 months ago

@PlamenDoychev thanks for explanation of how this issue has come about. It makes more sense now why we are talking about this issue.

As for reproduction, since we seems to have different setup for our load-balancers and CF, I decided to shortcut CF and instead write a diego task to see how this will fix the issue.

I used @klapkov's script to generate a long string (140Kb) and created the following using cfdot create-task @task.json

{
  "task_guid": "sleeping-beauty",
  "domain": "my-tasks",
  "rootfs": "preloaded:cflinuxfs4",
  "action": {
    "run": {
      "path": "/bin/bash",
      "args": [
        "-c",
        "echo sleeping-beauty.starting; sleep 60; echo sleeping-beauty.finished | tee /tmp/result"
      ],
      "user": "vcap",
      "suppress_log_output": false
    }
  },
  "env": [
    {
      "name": "MY_VAR",
      "value": "<my long string>"
    }
  ],
  "disk_mb": 128,
  "memory_mb": 1024,
  "cpu_weight": 50,
  "privileged": false,
  "log_source": "sleeping-beauty",
  "log_guid": "abc-123",
  "result_file": "/tmp/result",
  "legacy_download_user": "vcap"
}

I then installed log-cache-cli and ran cf tail "abc-123" -f. I am still not seeing failed-creating-process when I fetch the logs. Can you please modify or change the above task.json into something that will produce your expected error message.

klapkov commented 6 months ago

Hello @winkingturtle-vmw

I tried running the same task as you on our side, but something in garden and silk brakes and I do not even reach the point where we start the container. I have attached a staging task that resulted in the error. I hope you can see something useful there.

winkingturtle-vmw commented 6 months ago

@klapkov I went ahead and modified a staging task with the same long environment variable instead and seeing the argument list too long error regardless of this change. Are you seeing the same thing?

without-executor-changes
klapkov commented 6 months ago

@winkingturtle-vmw Here are the logs I see:

Retrieving logs for app mentos-cola in org batman / space ss as provisioned_user_cf_admin...

   2024-03-11T15:25:10.36+0000 [API/0] OUT Applied manifest to app with guid 8723d6f7-bb77-41f9-b5e6-59a95c7ca3e3 (---
   2024-03-11T15:25:10.36+0000 [API/0] OUT applications:
   2024-03-11T15:25:10.37+0000 [API/0] OUT - name: mentos-cola
   2024-03-11T15:25:10.37+0000 [API/0] OUT disk-quota: 4G
   2024-03-11T15:25:10.37+0000 [API/0] OUT memory: 500M
   2024-03-11T15:25:10.37+0000 [API/0] OUT random-route: true
   2024-03-11T15:25:10.37+0000 [API/0] OUT buildpack: https://github.com/cloudfoundry/go-buildpack.git
   2024-03-11T15:25:10.37+0000 [API/0] OUT env: "[PRIVATE DATA HIDDEN]"
   2024-03-11T15:25:10.37+0000 [API/0] OUT )
   2024-03-11T15:25:10.84+0000 [API/0] OUT Scaling process: "web"
   2024-03-11T15:25:10.90+0000 [API/0] OUT Updated app with guid 8723d6f7-bb77-41f9-b5e6-59a95c7ca3e3 ({"lifecycle"=>{"data"=>{"buildpacks"=>["https://github.com/cloudfoundry/go-buildpack.git"]}}})
   2024-03-11T15:25:10.94+0000 [API/0] OUT Updated app with guid 8723d6f7-bb77-41f9-b5e6-59a95c7ca3e3 ({"environment_variables"=>"[PRIVATE DATA HIDDEN]"})
   2024-03-11T15:25:15.03+0000 [API/0] OUT Uploading app package for app with guid 8723d6f7-bb77-41f9-b5e6-59a95c7ca3e3
   2024-03-11T15:25:18.21+0000 [API/2] OUT Creating build for app with guid 8723d6f7-bb77-41f9-b5e6-59a95c7ca3e3
   2024-03-11T15:25:18.64+0000 [STG/0] OUT Cell 959b73ca-8746-4670-9446-3ecf518f884c creating container for instance b2256166-4635-487f-8777-b461c54a6678
   2024-03-11T15:25:19.25+0000 [STG/0] OUT Cell 959b73ca-8746-4670-9446-3ecf518f884c successfully created container for instance b2256166-4635-487f-8777-b461c54a6678
   2024-03-11T15:25:19.61+0000 [STG/0] OUT Downloading app package...
   2024-03-11T15:25:19.61+0000 [STG/0] OUT Downloading build artifacts cache...
   2024-03-11T15:25:19.68+0000 [STG/0] OUT Downloaded app package (1M)
   2024-03-11T15:25:22.78+0000 [STG/0] OUT Downloaded build artifacts cache (98.9M)
   2024-03-11T15:25:22.85+0000 [STG/0] OUT failed-creating-process: runc exec: exit status 255: exec failed: unable to start container process: exec /tmp/lifecycle/builder: argument list too long
   2024-03-11T15:25:23.23+0000 [STG/0] OUT Cell 959b73ca-8746-4670-9446-3ecf518f884c stopping instance b2256166-4635-487f-8777-b461c54a6678
   2024-03-11T15:25:23.23+0000 [STG/0] OUT Cell 959b73ca-8746-4670-9446-3ecf518f884c destroying container for instance b2256166-4635-487f-8777-b461c54a6678
   2024-03-11T15:25:23.30+0000 [API/0] ERR Failed to stage build: staging failed
   2024-03-11T15:25:24.21+0000 [STG/0] OUT Cell 959b73ca-8746-4670-9446-3ecf518f884c successfully destroyed container for instance b2256166-4635-487f-8777-b461c54a6678
winkingturtle-vmw commented 6 months ago

Merged executor PR. It will be released in the next version of diego-release.

Screenshot 2024-03-11 at 2 45 44 PM