ROSA: Identify differences in Logging

StevenBarre commented 2 years ago

Describe the issue Identify gaps where the deployment into a public cloud based cluster changes the way we can supply logging services.

What is the Value/Impact? App logs available

What is the plan? How will this get completed? Review ROSA and classic cluster logging, document differences, propose solutions.

Identify any dependencies

Definition of done A plan on how to close any gaps we can, or documentation of those we can't

StevenBarre commented 2 years ago

According to https://access.redhat.com/support/cases/#/case/03278768/ (under the advsol.mcs user), we can EITHER have logs get forwarded to CloudWatch OR we can install the usual EFK stack.

According to Julian

Each team can view and add to the existing logs, but they can’t remove what’s there. So each team is responsible for their own logging and has tools at the AWS Account level for managing logs.

On top of that, CPF also retains the logs over in an immutable bucket in the ASEA core account areas away from where teams can see or modify them. From there we ship down to the SecOps Hadoop SIEM on-prem.

So it seems like CloudWatch would be the preferred option as it ensures proper log retention and leveraging of the SEA platform. However, as the Ops team would own the AWS Account, product teams would be unable to see the CloudWatch logs.

tmorik commented 1 year ago

From the OCP 4.10 of ROSA, we can deploy both CloudWatch and EFK logging stack. It's been tested on rosa-lab under the Test Logging Operator and AWS CloudWatch on the Government ROSA Cluster#3096, successfully.

One caveat to using both logging is that it is costly. Since logs are resource eaters by nature, using both CloudWatch and EFK will make a big & twice(because they are collecting same logs) of running cost.

As we tested in the rosa-lab cluster, according to the AWS cost explorer, the CloudWatch is spending $40-$80 per day.

CloudWatch's Cost trend :

Note: Log forwarding from the ROSA cluster to CloudWatch began on November 16. This fee is a pay-as-you-go charge, so if CloudWatch receives more logs or users start querying via CloudWatch etc, there will be additional charges. For more details on the fee, please click here.

CloudWatch Pricing

tmorik commented 1 year ago

On the other hand, EFK does not affect the CloudWatch cost describe above, but is a cost of other service charges, such as ROSA service and EC2-instance services:

Started EFK on Nov 25th (red arrow). It jumped about $60/day. (Mainly EC2, ROSA's charges due to EFK's resource requests.)

This EFK deployment is very small and will not be used in a Prod environment. Therefore, if production resource requirements were applied, the cost would be much higher than this chart.

According to the document from RH, infra node where EFK is deployed is r5.xlarge which has only 4 vCPU and 32 GiB memory for PROD:

Consulting Engagement Report: BC Gov - Managed Cloud OpenShift Pathfinder

4.1.8. Cluster Sizing

• Two clusters will be deployed one lab cluster and a prod type cluster. • The Lab cluster will have up to 4 workers with 8 CPU cores, m6i.2xlarge EC2 instances. • The Prod type cluster will have up to 4 workers with 32 CPU cores, m6i.8xlarge EC2 instances. • The 3 master nodes are m5.2xlarge EC2 instances and the 2 infra nodes are r5.xlarge EC2 instances which is not configurable

This infra node size needs to be expanded in order to get the same level of EFK logging as current mid-size OCP clusters such as GOLD and GOLD DR. These have 32 CPU cores and 251 GiB memory, which is equivalent to r6in.8xlarge with 32 vCPUs and 256 GiB memory in Amazon.

Or at least the same size as a prod worker node (m6i.8xlarge).

This hourly cost is only for EC2 instance, the other charges, such as Data transfers, ROSA services, will be added to the total cost. To check pricing, see the link below:

For reference, the following resources are configured for EFK.

GOLD, GOLDDR

  cpuLimit: 4
  memLimit: 16Gi
  cpuReq: 1
  memReq: 16Gi

SILVER

  cpuLimit: 16
  memLimit: 64Gi
  cpuReq: 8
  memReq: 64Gi

ROSA-LAB

  cpuLimit: 1
  memLimit: 3Gi
  cpuReq: 500m
  memReq: 3Gi

tmorik commented 1 year ago

Another point that we should consider is that it's operating cost.

As with hybrid OCP clusters (Silver, Gold, etc.), EFK deployment requires patching and upgrading of EFK operators by the Cluster administrator, but AWS CloudWatch does not require such work as Amazon does it for you once you have configure the Cluster Log Forwarder for CloudWatch Logs.

tmorik commented 1 year ago

Summary of findings so far:

AWS CloudWatch is preferable for logging as it ensures proper log retention and leveraging of the SEA platform.
AWS CloudWatch logging is pay-as-you-go, which means that as logging increases, so does the cost.
Product Teams would be unable to see the CloudWatch logs.
From OCP 4.10, EFK logging operator can be co-existed with ClowdWatch.
Using EFK logging, Product Teams will be able to see the logs.
EFK logging requires large EC2 instance for infra nodes and will costly.
EFK logging requires manual updates/patching, but CloudWatch does not.

Probably we should not use both loggings at the same time as it's waste of cluster resources and money. If we can resolve 3. Product Teams would be unable to see the CloudWatch logs, the Cloudwatch would be the best solution.

tmorik commented 1 year ago

Word doc are drafting at here (advsol Teams): https://advsolcan.sharepoint.com/:w:/r/sites/ManagedContainerServicesMCS/Shared%20Documents/ROSA%20or%20ARO/Differences%20between%20EFK%20and%20CloudWatch%20logging.docx?d=w3fbf44ba3113485d962b3cc4c2eb09bd&csf=1&web=1&e=Ej6cYF&isSPOFile=1

tmorik commented 1 year ago

Waiting for review.

StevenBarre commented 1 year ago

Very thorough @tmorik ! Looks good to me. We'll wait for Milo to get back from vacation and hand it over to him to review.

StevenBarre commented 1 year ago

Passed info to Milo, closing

BCDevOps / developer-experience

ROSA: Identify differences in Logging #3157

4.1.8. Cluster Sizing