Closed cris-oddball closed 1 year ago
From Twistlock system log:
Yesterday, Lucas terminated the instance and re-deployed (since this EC2 uses terraform). Instance went into a failed status check today after the GutHub Action successfully ran. Restarting instance - storage is 36GiB so will look for any logs that can be removed if able to connect.
Questions:
Coverage and pricing
Now that you have an idea of the new scanning alternatives available for Amazon ECR, let’s address the questions of coverage and costs. At the moment, Amazon ECR provides CVE scanning for OS packages for most common Linux distributions, including Debian, Ubuntu, and Amazon Linux. Please refer to the supported OS and programming languages documentation for an up-to-date listing.
Basic scanning is available with no additional charge to Amazon ECR customers. Each container image may be scanned once per 24 hours with basic scanning. This limit includes the initial scan on push, if enabled, and any manual scans.
Enhanced scanning is provided and billed through Amazon Inspector. All accounts new to Amazon Inspector are eligible for a 15-day trial to evaluate the service and estimate its cost. During the trial, all eligible container images pushed to Amazon ECR are continually scanned at no cost.
Pricing for enhanced scanning has two tiers. The cost for container images scanned initially on-push to Amazon ECR is $0.09 per image scan. Each re-scan for container image in Amazon ECR configured for continuous scanning is $0.01 per image re-scan. Whenever Amazon Inspector adds a new CVE to its database, all eligible containers images in your configured Amazon ECR repositories are automatically re-scanned.
cc @ldraney @mjones-oddball @k-macmillan
Twistlock instance came up with 2/2 checks passed.
Twistlock scan completed on a deploy to Perf and status checks are still 2/2. Stable for the moment, but we need a plan here.
TL fell over again on the next scan. Restarting to find logs and start removing them.
Some logs in here that are candidates for deletion, specifically the messages one. -rw-------. 1 root root 5556423 Jul 20 18:22 messages
tailing this log yielded the same info that is in CloudWatch.
from TL error logs
2023-07-20 18:39:09 ERROR [func1 @ datachannel.go.491] [ssm-session-worker] [OITECHSTODDC0-0da6ef9472ac3ee51] [DataBackend] [pluginName=Standard_Stream] Unable to send stream data message: Can't send message: Connection is closed.
@ldraney with @k-macmillan 's green light, I deleted the /var/logs/messages file.
We need to run a cron job to delete that file once a week until we come up with a LT solution.
Twistlock scanned a few times without falling over, so we are good until this log fills up again.
On two twistlock scans today, the server failed a status check (the scans were also run at the same time, in different environments.
Stopping and restarting to see how big the messages log is and to re-run those scans.
Messages grew overnight:
-rw-------. 1 root root 4458611 Jul 21 16:45 messages
Documentation on AWS CodeScan and Amazon Inspector
cc @k-macmillan @mjones-oddball @ldraney
Twistlock has fallen over again, restarting and deleting the messages file when it comes back up.
-rw-------. 1 root root 6060854 Jul 21 20:29 messages
That's only 5-6mb. I wouldn't expect issues? Is there anything interesting in there?
@k-macmillan the output is all piped to CloudWatch, there are some interesting things in there, and (this is a guess) whatever is writing to that file is trying to open it and struggling to do so to add another line. That is not that big of a file as far as storage is concerned, but memory might be limiting the write operation.... which is why most log files are configured to roll over to a new file at a certain size (say, 1MB).
Twistlock failed 1/2 status checks over the weekend. Restarting.
Just waiting for Twistlock to restart:
Cron job set up to delete twistlock logs every 24 hours:
sudo crontab -e
I tested running twistlock and resource utilization was non-critical.
Here is a video of running twistlock with top
open with this script:
#!/bin/bash
ECR_REGISTRY="registry"
IMAGE_REPOSITORY="repo"
IMAGE_TAG="tag"
aws --region us-gov-west-1 ecr get-login-password | docker login --username AWS --password-stdin "$ECR_REGISTRY" 2>/dev/null
docker pull "$ECR_REGISTRY"/"$IMAGE_REPOSITORY":"$IMAGE_TAG" 1>/dev/null
docker logout "$ECR_REGISTRY"
export TWISTLOCK_PASSWORD=$(aws --region us-gov-west-1 ssm get-parameter --name /utility/twistlock/vanotify-ci-user-password --with-decryption | jq '.Parameter.Value' -r)
twistcli images scan --project vanotify --address https://twistlock.devops.va.gov --user vanotify-ci-user "$ECR_REGISTRY"/"$IMAGE_REPOSITORY":"$IMAGE_TAG"
STATUS_CODE=$?
docker image prune -a -f 1>/dev/null
exit $STATUS_CODE
So it doesn't seem necessary to upgrade ram mem, at least.
Twistlock running successfully:
I'm trying to connect to the instance but the screen is now blank:
I'm stopping the instance to restart and get in and try a:
df -h
to see how much disk space we are utilizing.
df -h gives us:
And I tested twistlock again and saw that memory goes down to 49mb out of 2000, (so we are at 98% memory utilization). It will very likely make the difference to increase the ec2 size at this point.
Twistlock System Requirements (geared toward the main server, which the twistcli connect to, does not specifically speak to what an instance running twistcli should be scoped to)
Twistlock documentation on twistcli
Also, since t3 is a burstable instance type that uses credits, it may be that we simply ran out of credits, which would severely limit the twistlock performance; this would also explain why it often works.
Latest video shows t3.medium at 75% usage
So the cause is likely too small an ec2 for the twistlock:
I gathered the following data in the t3.medium:
Not Under Load:
Under Load / running Twistlock:
Therefore, the memory utilization when not under load is roughly 43.94%, and under load, it is approximately 78.59%.
Additional Test - Running Twistlock Twice Simultaneously
The test of running Twistlock twice simultaneously shows that memory utilization spikes to approximately 97.37%.
Please, let me know if you need further information or testing.
re-deployed ec2 t3.medium and redid the cronjob: opening the PR now
I'm not sure what to put on this follow-on spike ticket for twistlock. Cris started a doc and I added to the end. It sounds like we'll finish that doc and send it to Mel and Bev. I’m waiting to confirm this course of action with Cris and whoever before closing this ticket. Meanwhile, I think the best course of action is to leave Twistlock until our better solution is approved and available.
@ldraney Can you make sure @k-macmillan approves before any plans go to me or Bev to approve? Is it a cost thing that we'll need to approve?
From Kyle:
Lucas compare the cost of running t3.medium 24/7/365 vs the advanced scan costs. Try and grab how many scans we do per week and multiply it by 52 (a week will be more accurate than a day). TBH engineer costs to maintain Twistlock are likely greater than any cost for scanning.
@ldraney to add costs to document and create new ticket, then we can close this one out.
Twistlock failed 1/2 status checks again today and failed to run a scan. Stopping, restarting, and logging in to check things.
restarted, it came back up, re-tried scan, that worked.
From Lucas
Well in the last fifteen minutes we got 1k errors saying VAEC network is unreachable. Twistlock is set up on the AMI, and VAEC may have set up their own server The way Twistlock works is it collects information about the packages and binaries in the container image and sends this data to the Prisma Cloud Console for analysis. I'm guessing VAEC set up their own server for security reasons. So, Twistlock may be failing due to reasons outside our control. Another reason to move away from it, if I'm right.
Logs are now in DD, some patterns note that the failure is on some heartbeat to logstash/
Jul 27 HH:mm:ss
ip-<redacted>*
: yyyy-MM-ddTHH:mm:ss.SSSXXX
#011INFO#011[publisher_pipeline_output]#011pipeline/output.go:145#011Attempting to reconnect to backoff(async(tcp://aws-logstash-west.<redacted>)) with [1-210]
reconnect attempt(s)
Jul 27 HH:mm:ss
ip-<redacted> *
: yyyy-MM-ddTHH:mm:ss.SSSXXX
#011ERROR#011[publisher_pipeline_output]#011pipeline/output.go:154#011Failed to connect to backoff(async(tcp://aws-logstash-west.<redacted>)): lookup aws-logstash-west.ecs.vaec.va.gov on 10.247.142.130:53: dial udp 10.247.142.130:53: connect: network is unreachable
Jul 27 ip-<redacted> kernel: audit: audit_backlog=8193 > audit_backlog_limit=8192
we are reaching some limit on auditing.
Info on audit logs https://docs.paloaltonetworks.com/prisma/prisma-cloud/prisma-cloud-admin-compute/audit/logging
Unable to connect to session manager, restarting.
Lucas quadrupled the audit_backlog_limit
Twistlock fell over again today.
ROI document for switching to AWS Inspector
Story created for implementing AWS Inspector
Next steps:
Description
The twistlock instance is responding to ssm commands with InProgress and then failing to run the scan. The instance itself also eventually reports that the 'Instance reachability check failed'. Typically this has been resolved by stopping the instance, starting the instance, waiting for the 2 status checks to pass, then re-starting the Twistlock action.
This morning, the Twistlock instance was restarted twice after a failure last night, the scan rerun performed twice and the result is:
Eventually the instance failed its reachability status check.
Steps to Reproduce
Workaround
Is there something we can do to work around this issue in the meantime? No - there is no workaround at present.
Impact/Urgency
High - Twistlock scans our app container for security vulnerabilities.
Expected Behavior
/var/log/messsages
dailyQA Considerations
Additional Info & Resources
' There are Cloudwatch logs for the Twistlock EC2 which should be reviewed to find any smoking guns. From the errors.log: ``` 2023-07-19 13:41:15 ERROR [New @ errors.go.59] [ssm-agent-worker] [MessageService] [Association] [EngineProcessor] [OutOfProcExecuter] [43fb267c-f4d1-4f2d-959b-041a43de4339.2023-07-18T14-07-46.567Z] received timedout signal! 2023-07-19 13:41:15 ERROR [messaging @ master.go.131] [ssm-agent-worker] [MessageService] [Association] [EngineProcessor] [OutOfProcExecuter] [43fb267c-f4d1-4f2d-959b-041a43de4339.2023-07-18T14-07-46.567Z] messaging worker encountered error: document worker timed out ``` From the failed action on the Perform Twistlock scan, before it starts printing out the WaiterCommands ``` Run INSTANCE_ID=$(aws ssm get-parameter --name /utility/twistlock/instance | jq '.Parameter.Value' -r) INSTANCE_ID=$(aws ssm get-parameter --name /utility/twistlock/instance | jq '.Parameter.Value' -r) COMMAND_ID=$(aws ssm send-command \ --document-name "AWS-RunShellScript" \ --instance-ids $INSTANCE_ID \ --parameters '{"commands":[ "#!/bin/bash",