Closed StevenBarre closed 1 year ago
According to https://access.redhat.com/support/cases/#/case/03278768/ (under the advsol.mcs user), we can EITHER have logs get forwarded to CloudWatch OR we can install the usual EFK stack.
According to Julian
Each team can view and add to the existing logs, but they can’t remove what’s there. So each team is responsible for their own logging and has tools at the AWS Account level for managing logs.
On top of that, CPF also retains the logs over in an immutable bucket in the ASEA core account areas away from where teams can see or modify them. From there we ship down to the SecOps Hadoop SIEM on-prem.
So it seems like CloudWatch would be the preferred option as it ensures proper log retention and leveraging of the SEA platform. However, as the Ops team would own the AWS Account, product teams would be unable to see the CloudWatch logs.
From the OCP 4.10 of ROSA, we can deploy both CloudWatch and EFK logging stack. It's been tested on rosa-lab under the Test Logging Operator and AWS CloudWatch on the Government ROSA Cluster#3096, successfully.
One caveat to using both logging is that it is costly. Since logs are resource eaters by nature, using both CloudWatch and EFK will make a big & twice(because they are collecting same logs) of running cost.
As we tested in the rosa-lab cluster, according to the AWS cost explorer, the CloudWatch is spending $40-$80 per day.
CloudWatch's Cost trend :
Note: Log forwarding from the ROSA cluster to CloudWatch began on November 16. This fee is a pay-as-you-go charge, so if CloudWatch receives more logs or users start querying via CloudWatch etc, there will be additional charges. For more details on the fee, please click here.
On the other hand, EFK does not affect the CloudWatch cost describe above, but is a cost of other service charges, such as ROSA service and EC2-instance services:
Started EFK on Nov 25th (red arrow). It jumped about $60/day. (Mainly EC2, ROSA's charges due to EFK's resource requests.)
This EFK deployment is very small and will not be used in a Prod environment. Therefore, if production resource requirements were applied, the cost would be much higher than this chart.
According to the document from RH, infra node where EFK is deployed is r5.xlarge
which has only 4
vCPU and 32 GiB
memory for PROD:
Consulting Engagement Report: BC Gov - Managed Cloud OpenShift Pathfinder
4.1.8. Cluster Sizing
• Two clusters will be deployed one lab cluster and a prod type cluster. • The Lab cluster will have up to 4 workers with 8 CPU cores, m6i.2xlarge EC2 instances. • The Prod type cluster will have up to 4 workers with 32 CPU cores, m6i.8xlarge EC2 instances. • The 3 master nodes are m5.2xlarge EC2 instances and the 2 infra nodes are r5.xlarge EC2 instances which is not configurable
This infra node size needs to be expanded in order to get the same level of EFK logging as current mid-size OCP clusters such as GOLD and GOLD DR. These have 32 CPU cores and 251 GiB memory, which is equivalent to r6in.8xlarge
with 32 vCPUs and 256 GiB memory in Amazon.
Or at least the same size as a prod worker node (m6i.8xlarge).
This hourly cost is only for EC2 instance, the other charges, such as Data transfers, ROSA services, will be added to the total cost. To check pricing, see the link below:
For reference, the following resources are configured for EFK.
GOLD, GOLDDR
cpuLimit: 4
memLimit: 16Gi
cpuReq: 1
memReq: 16Gi
SILVER
cpuLimit: 16
memLimit: 64Gi
cpuReq: 8
memReq: 64Gi
ROSA-LAB
cpuLimit: 1
memLimit: 3Gi
cpuReq: 500m
memReq: 3Gi
Another point that we should consider is that it's operating cost.
As with hybrid OCP clusters (Silver, Gold, etc.), EFK deployment requires patching and upgrading of EFK operators by the Cluster administrator, but AWS CloudWatch does not require such work as Amazon does it for you once you have configure the Cluster Log Forwarder for CloudWatch Logs.
Summary of findings so far:
Probably we should not use both loggings at the same time as it's waste of cluster resources and money. If we can resolve 3. Product Teams would be unable to see the CloudWatch logs, the Cloudwatch would be the best solution.
Waiting for review.
Very thorough @tmorik ! Looks good to me. We'll wait for Milo to get back from vacation and hand it over to him to review.
Passed info to Milo, closing
Describe the issue Identify gaps where the deployment into a public cloud based cluster changes the way we can supply logging services.
What is the Value/Impact? App logs available
What is the plan? How will this get completed? Review ROSA and classic cluster logging, document differences, propose solutions.
Identify any dependencies
Definition of done A plan on how to close any gaps we can, or documentation of those we can't