department-of-veterans-affairs / notification-api

Notification API
MIT License
16 stars 9 forks source link

BUG: Twistlock instance stuck at InProgress and failing status checks #1378

Closed cris-oddball closed 1 year ago

cris-oddball commented 1 year ago

Description

The twistlock instance is responding to ssm commands with InProgress and then failing to run the scan. The instance itself also eventually reports that the 'Instance reachability check failed'. Typically this has been resolved by stopping the instance, starting the instance, waiting for the 2 status checks to pass, then re-starting the Twistlock action.

This morning, the Twistlock instance was restarted twice after a failure last night, the scan rerun performed twice and the result is:

Waiter CommandExecuted failed: Max attempts exceeded. Previously accepted state: For expression "Status" we matched expected path: "InProgress"

Waiter CommandExecuted failed: Max attempts exceeded. Previously accepted state: For expression "Status" we matched expected path: "InProgress"

Waiter CommandExecuted failed: Max attempts exceeded. Previously accepted state: For expression "Status" we matched expected path: "InProgress"

Waiter CommandExecuted failed: Max attempts exceeded. Previously accepted state: For expression "Status" we matched expected path: "InProgress"

Waiter CommandExecuted failed: Max attempts exceeded. Previously accepted state: For expression "Status" we matched expected path: "InProgress"
"InProgress"
""
Error: Process completed with exit code 255.

Eventually the instance failed its reachability status check. Screen Shot 2023-07-19 at 10.50.37 AM.jpg

Steps to Reproduce

  1. Stop the utility-twistlock-scanner instance
  2. When fully stopped, start the utility-twistlock-scanner instance
  3. Wait for the instance to report Running and that the two status checks have passed
  4. Re-run this Twistlock job
  5. Observe that the action fails with the Waiter CommandExecuted failed: never reaching the expected status and exiting with error code 255 after 5 attempts
  6. Wait some period of time and observe that the reachability status_check fails

Workaround

Is there something we can do to work around this issue in the meantime? No - there is no workaround at present.

Impact/Urgency

High - Twistlock scans our app container for security vulnerabilities.

Expected Behavior

QA Considerations

Additional Info & Resources

' There are Cloudwatch logs for the Twistlock EC2 which should be reviewed to find any smoking guns. From the errors.log: ``` 2023-07-19 13:41:15 ERROR [New @ errors.go.59] [ssm-agent-worker] [MessageService] [Association] [EngineProcessor] [OutOfProcExecuter] [43fb267c-f4d1-4f2d-959b-041a43de4339.2023-07-18T14-07-46.567Z] received timedout signal! 2023-07-19 13:41:15 ERROR [messaging @ master.go.131] [ssm-agent-worker] [MessageService] [Association] [EngineProcessor] [OutOfProcExecuter] [43fb267c-f4d1-4f2d-959b-041a43de4339.2023-07-18T14-07-46.567Z] messaging worker encountered error: document worker timed out ``` From the failed action on the Perform Twistlock scan, before it starts printing out the WaiterCommands ``` Run INSTANCE_ID=$(aws ssm get-parameter --name /utility/twistlock/instance | jq '.Parameter.Value' -r) INSTANCE_ID=$(aws ssm get-parameter --name /utility/twistlock/instance | jq '.Parameter.Value' -r) COMMAND_ID=$(aws ssm send-command \ --document-name "AWS-RunShellScript" \ --instance-ids $INSTANCE_ID \ --parameters '{"commands":[ "#!/bin/bash", 1>/dev/null", "exit $STATUS_CODE" ]}' \ --output text --query "Command.CommandId") n=0 until [ "$n" -ge 5 ] do aws ssm wait command-executed --command-id $COMMAND_ID --instance-id $INSTANCE_ID && break n=$((n+1)) done RESULT=$(aws ssm list-command-invocations \ --command-id $COMMAND_ID \ --details \ --query "CommandInvocations[].CommandPlugins[].{Status:Status,StatusDetails:StatusDetails,ResponseCode:ResponseCode,Output:Output}") echo `jq '.[].Status' <<< $RESULT` printf '%b\n' "$(jq '.[].Output' <<< $RESULT)" exit `jq '.[].ResponseCode' <<< $RESULT` shell: /usr/bin/bash --noprofile --norc -e -o pipefail {0} env: ``` [AWS Understanding Command Statuses](https://docs.aws.amazon.com/systems-manager/latest/userguide/monitor-commands.html) [AWS Send Command output](https://docs.aws.amazon.com/cli/latest/reference/ssm/send-command.html#output) - info on timeouts - we never reach a timeout but might be useful [AWS Troubleshooting SSM Agent](https://docs.aws.amazon.com/systems-manager/latest/userguide/troubleshooting-ssm-agent.html)
ldraney commented 1 year ago

From Twistlock system log:

image.png

cris-oddball commented 1 year ago

Yesterday, Lucas terminated the instance and re-deployed (since this EC2 uses terraform). Instance went into a failed status check today after the GutHub Action successfully ran. Restarting instance - storage is 36GiB so will look for any logs that can be removed if able to connect.

cris-oddball commented 1 year ago

Questions:

Now that you have an idea of the new scanning alternatives available for Amazon ECR, let’s address the questions of coverage and costs. At the moment, Amazon ECR provides CVE scanning for OS packages for most common Linux distributions, including Debian, Ubuntu, and Amazon Linux. Please refer to the supported OS and programming languages documentation for an up-to-date listing.

Basic scanning is available with no additional charge to Amazon ECR customers. Each container image may be scanned once per 24 hours with basic scanning. This limit includes the initial scan on push, if enabled, and any manual scans.

Enhanced scanning is provided and billed through Amazon Inspector. All accounts new to Amazon Inspector are eligible for a 15-day trial to evaluate the service and estimate its cost. During the trial, all eligible container images pushed to Amazon ECR are continually scanned at no cost.

Pricing for enhanced scanning has two tiers. The cost for container images scanned initially on-push to Amazon ECR is $0.09 per image scan. Each re-scan for container image in Amazon ECR configured for continuous scanning is $0.01 per image re-scan. Whenever Amazon Inspector adds a new CVE to its database, all eligible containers images in your configured Amazon ECR repositories are automatically re-scanned.

cc @ldraney @mjones-oddball @k-macmillan

cris-oddball commented 1 year ago

Twistlock instance came up with 2/2 checks passed.

cris-oddball commented 1 year ago

Twistlock scan completed on a deploy to Perf and status checks are still 2/2. Stable for the moment, but we need a plan here.

cris-oddball commented 1 year ago

TL fell over again on the next scan. Restarting to find logs and start removing them.

cris-oddball commented 1 year ago

Some logs in here that are candidates for deletion, specifically the messages one. -rw-------. 1 root root 5556423 Jul 20 18:22 messages

tailing this log yielded the same info that is in CloudWatch.

cris-oddball commented 1 year ago

from TL error logs

2023-07-20 18:39:09 ERROR [func1 @ datachannel.go.491] [ssm-session-worker] [OITECHSTODDC0-0da6ef9472ac3ee51] [DataBackend] [pluginName=Standard_Stream] Unable to send stream data message: Can't send message: Connection is closed.
cris-oddball commented 1 year ago

@ldraney with @k-macmillan 's green light, I deleted the /var/logs/messages file.

We need to run a cron job to delete that file once a week until we come up with a LT solution.

Twistlock scanned a few times without falling over, so we are good until this log fills up again.

cris-oddball commented 1 year ago

On two twistlock scans today, the server failed a status check (the scans were also run at the same time, in different environments.

Stopping and restarting to see how big the messages log is and to re-run those scans.

cris-oddball commented 1 year ago

Messages grew overnight:

-rw-------.  1 root   root            4458611 Jul 21 16:45 messages
cris-oddball commented 1 year ago

Documentation on AWS CodeScan and Amazon Inspector

cc @k-macmillan @mjones-oddball @ldraney

cris-oddball commented 1 year ago

Twistlock has fallen over again, restarting and deleting the messages file when it comes back up.

cris-oddball commented 1 year ago

-rw-------. 1 root root 6060854 Jul 21 20:29 messages

k-macmillan commented 1 year ago

That's only 5-6mb. I wouldn't expect issues? Is there anything interesting in there?

cris-oddball commented 1 year ago

@k-macmillan the output is all piped to CloudWatch, there are some interesting things in there, and (this is a guess) whatever is writing to that file is trying to open it and struggling to do so to add another line. That is not that big of a file as far as storage is concerned, but memory might be limiting the write operation.... which is why most log files are configured to roll over to a new file at a certain size (say, 1MB).

cris-oddball commented 1 year ago

Twistlock failed 1/2 status checks over the weekend. Restarting.

ldraney commented 1 year ago

Just waiting for Twistlock to restart: image.png

ldraney commented 1 year ago

Cron job set up to delete twistlock logs every 24 hours:

sudo crontab -e

image.png

ldraney commented 1 year ago

I tested running twistlock and resource utilization was non-critical. Here is a video of running twistlock with top open with this script:

#!/bin/bash

ECR_REGISTRY="registry"
IMAGE_REPOSITORY="repo"
IMAGE_TAG="tag"

aws --region us-gov-west-1 ecr get-login-password | docker login --username AWS --password-stdin "$ECR_REGISTRY" 2>/dev/null
docker pull "$ECR_REGISTRY"/"$IMAGE_REPOSITORY":"$IMAGE_TAG" 1>/dev/null
docker logout "$ECR_REGISTRY"
export TWISTLOCK_PASSWORD=$(aws --region us-gov-west-1 ssm get-parameter --name /utility/twistlock/vanotify-ci-user-password --with-decryption | jq '.Parameter.Value' -r)
twistcli images scan --project vanotify --address https://twistlock.devops.va.gov --user vanotify-ci-user "$ECR_REGISTRY"/"$IMAGE_REPOSITORY":"$IMAGE_TAG"
STATUS_CODE=$?
docker image prune -a -f 1>/dev/null
exit $STATUS_CODE

So it doesn't seem necessary to upgrade ram mem, at least.

ldraney commented 1 year ago

Twistlock running successfully: image.png

ldraney commented 1 year ago

I'm trying to connect to the instance but the screen is now blank: image.png

ldraney commented 1 year ago

I'm stopping the instance to restart and get in and try a:

df -h

to see how much disk space we are utilizing.

ldraney commented 1 year ago

df -h gives us: image.png

And I tested twistlock again and saw that memory goes down to 49mb out of 2000, (so we are at 98% memory utilization). It will very likely make the difference to increase the ec2 size at this point.

cris-oddball commented 1 year ago

Twistlock System Requirements (geared toward the main server, which the twistcli connect to, does not specifically speak to what an instance running twistcli should be scoped to)

Twistlock documentation on twistcli

ldraney commented 1 year ago

I'm increasing to t3.medium: image.png

ldraney commented 1 year ago

Also, since t3 is a burstable instance type that uses credits, it may be that we simply ran out of credits, which would severely limit the twistlock performance; this would also explain why it often works.

ldraney commented 1 year ago

Latest video shows t3.medium at 75% usage

ldraney commented 1 year ago

So the cause is likely too small an ec2 for the twistlock:

I gathered the following data in the t3.medium:

  1. Not Under Load:

    • Total Memory: 3977792 KiB
    • Used Memory: 958796 KiB
    • Total Used Memory = Used Memory + Buffers/Cache = 958796 KiB + 789184 KiB = 1747980 KiB
    • Percentage of memory used = (Total Used Memory / Total Memory) 100 = (1747980 / 3977792) 100 ≈ 43.94%
  2. Under Load / running Twistlock:

    • Total Memory: 3977792 KiB
    • Used Memory: 1134212 KiB
    • Total Used Memory = Used Memory + Buffers/Cache = 1134212 KiB + 1991764 KiB = 3125976 KiB
    • Percentage of memory used = (Total Used Memory / Total Memory) 100 = (3125976 / 3977792) 100 ≈ 78.59%

Therefore, the memory utilization when not under load is roughly 43.94%, and under load, it is approximately 78.59%.

Additional Test - Running Twistlock Twice Simultaneously

The test of running Twistlock twice simultaneously shows that memory utilization spikes to approximately 97.37%.

Please, let me know if you need further information or testing.

ldraney commented 1 year ago

re-deployed ec2 t3.medium and redid the cronjob: image.png opening the PR now

ldraney commented 1 year ago

I'm not sure what to put on this follow-on spike ticket for twistlock. Cris started a doc and I added to the end. It sounds like we'll finish that doc and send it to Mel and Bev. I’m waiting to confirm this course of action with Cris and whoever before closing this ticket. Meanwhile, I think the best course of action is to leave Twistlock until our better solution is approved and available.

mjones-oddball commented 1 year ago

@ldraney Can you make sure @k-macmillan approves before any plans go to me or Bev to approve? Is it a cost thing that we'll need to approve?

ldraney commented 1 year ago

From Kyle:

Lucas compare the cost of running t3.medium 24/7/365 vs the advanced scan costs. Try and grab how many scans we do per week and multiply it by 52 (a week will be more accurate than a day). TBH engineer costs to maintain Twistlock are likely greater than any cost for scanning.

cris-oddball commented 1 year ago

Documentation for replacing Twistlock

cris-oddball commented 1 year ago

@ldraney to add costs to document and create new ticket, then we can close this one out.

cris-oddball commented 1 year ago

Twistlock failed 1/2 status checks again today and failed to run a scan. Stopping, restarting, and logging in to check things.

cris-oddball commented 1 year ago

restarted, it came back up, re-tried scan, that worked.

cris-oddball commented 1 year ago

From Lucas

Well in the last fifteen minutes we got 1k errors saying VAEC network is unreachable. Twistlock is set up on the AMI, and VAEC may have set up their own server The way Twistlock works is it collects information about the packages and binaries in the container image and sends this data to the Prisma Cloud Console for analysis. I'm guessing VAEC set up their own server for security reasons. So, Twistlock may be failing due to reasons outside our control. Another reason to move away from it, if I'm right.

Logs are now in DD, some patterns note that the failure is on some heartbeat to logstash/

Jul 27 HH:mm:ss
 ip-<redacted>*
: yyyy-MM-ddTHH:mm:ss.SSSXXX
#011INFO#011[publisher_pipeline_output]#011pipeline/output.go:145#011Attempting to reconnect to backoff(async(tcp://aws-logstash-west.<redacted>)) with [1-210]
 reconnect attempt(s)

Jul 27 HH:mm:ss
 ip-<redacted> *
: yyyy-MM-ddTHH:mm:ss.SSSXXX
#011ERROR#011[publisher_pipeline_output]#011pipeline/output.go:154#011Failed to connect to backoff(async(tcp://aws-logstash-west.<redacted>)): lookup aws-logstash-west.ecs.vaec.va.gov on 10.247.142.130:53: dial udp 10.247.142.130:53: connect: network is unreachable

Jul 27  ip-<redacted> kernel: audit: audit_backlog=8193 > audit_backlog_limit=8192

we are reaching some limit on auditing.

cris-oddball commented 1 year ago

Info on audit logs https://docs.paloaltonetworks.com/prisma/prisma-cloud/prisma-cloud-admin-compute/audit/logging

Unable to connect to session manager, restarting.

cris-oddball commented 1 year ago

Lucas quadrupled the audit_backlog_limit

cris-oddball commented 1 year ago

Twistlock fell over again today.

cris-oddball commented 1 year ago

ROI document for switching to AWS Inspector

cris-oddball commented 1 year ago

Story created for implementing AWS Inspector

cris-oddball commented 1 year ago

Next steps: