IBM / core-dump-handler

Save core dumps from a Kubernetes Service or RedHat OpenShift to an S3 protocol compatible object store
https://ibm.github.io/core-dump-handler/
MIT License
131 stars 40 forks source link

Compatibility with Oracle Kubernetes Engine. #146

Open shb-mll opened 1 year ago

shb-mll commented 1 year ago

Hi Team,

I want to install the core dump handler on a OKE cluster (the nodes are on v1.24.1 with Oracle-Linux-8.7) . The https://github.com/IBM/core-dump-handler#kubernetes-service-compatibility doesn't list oracle linux as a supported product. However could you confirm if this can be deployed OKE ? If yes, could you provide the below.

No9 commented 1 year ago

Hey @shb-mll Unfortunately I don't have a OKE account so I can't provide supporting configuration or the configuration to send core dumps to the OCI buckets.

That said if you want to try to install it and report errors or questons in this thread I am happy to troubleshoot with you as much as I can.

Also happy for you to land the change you discover as a PR.

shb-mll commented 1 year ago

I installed core-dump-handler on OKE with below values for daemonset. segfaulter test confirm the cores were collected at /var/mnt/core-dump-handler/cores on the node.

daemonset:
  crioEndpoint: "unix:///var/run/crio/crio.sock"
  hostContainerRuntimeEndpoint: "/run/crio/crio.sock"
  mountContainerRuntimeEndpoint: true
  extraEnvVars: |-
    - name: S3_ENDPOINT
      value: "https://{bucketnamespace}.compat.objectstorage.us-ashburn-1.oraclecloud.com"
  s3BucketName: "BUCKETNAME"
  s3Region: "us-ashburn-1"
  s3Secret: "XXX"
  s3AccessKey: "XXX"

OCI is amazon s3 API compatible - link. As outlined in the prerequiste section of the article, I have configured the below to setup access from Amazon S3(coredumphandler) to Object Storage. a. designated a compartment for the Amazon S3 Compatibility API b. customer secret key c. S3_ENDPOINT

However I see below error during upload to the OCI bucket.

[2023-05-20T09:20:52Z INFO  core_dump_agent] Executing Agent with location : /var/mnt/core-dump-handler/cores
[2023-05-20T09:20:52Z INFO  core_dump_agent] Setting s3 endpoint location to: https://{bucketnamespace}.compat.objectstorage.us-ashburn-1.oraclecloud.com
[2023-05-20T09:20:52Z INFO  core_dump_agent] Dir Content ["/var/mnt/core-dump-handler/cores/858c15b9-8ef2-47b5-97ac-6ce2febb272a-dump-1684532650-segfaulter-segfaulter-1-4.zip"]
[2023-05-20T09:20:52Z INFO  core_dump_agent] Uploading: /var/mnt/core-dump-handler/cores/858c15b9-8ef2-47b5-97ac-6ce2febb272a-dump-1684532650-segfaulter-segfaulter-1-4.zip
[2023-05-20T09:20:52Z INFO  core_dump_agent] zip size is 29662
[2023-05-20T09:20:52Z ERROR core_dump_agent] Upload Failed Got HTTP 403 with content '<?xml version="1.0" encoding="UTF-8"?><Error><Message>The required information to complete authentication was not provided.</Message><Code>SignatureDoesNotMatch</Code></Error>'
[2023-05-20T09:20:52Z INFO  core_dump_agent] INotify Starting...
[2023-05-20T09:20:52Z INFO  core_dump_agent] INotify Initialised...
[2023-05-20T09:20:52Z INFO  core_dump_agent] INotify watching : /var/mnt/core-dump-handler/cores
shb-mll commented 1 year ago

ok it appears that there was some issue with node where the coredump was collected. I removed the node and ran the segfaulter test again and it completed with a successful upload to the OCI bucket.

However storing the customer secret key in a kubernetes secret is not optimal way, need to find a better way to authenticate to OCI bucket.

No9 commented 1 year ago

Hey @shb-mll This is excellent progress thanks for the update. In terms of a different way to manage access you may want to investigate if OKE workload identities are integrated with OCI buckets. There may be a similar pattern to the AWS security token service available. https://github.com/IBM/core-dump-handler/blob/main/charts/core-dump-handler/values.aws.sts.yaml

shb-mll commented 1 year ago

Hey @No9, Yeah I checked that however the workload identity implementation is different in OKE. As of today there is no concept of annotating a k8s service account with OCI IAM account. With workload identity feature the OKE service account does act as an identity however the authorisation is handled in the application code. So to use this one would need to update the coredumphandler code to use OkeWorkloadIdentityAuthenticationDetailsProvider and also provide the OCI resource to access.

Currently workload identity is only supported in Go and JAVA SDK's https://docs.oracle.com/en-us/iaas/Content/ContEng/Tasks/contenggrantingworkloadaccesstoresources.htm#:~:text=The%20following%20OCI,v2.54.0%20(and%20later). So I am not sure if it will work in RUST.

No9 commented 1 year ago

OK can you explain a bit more about what is meant by

storing the customer secret key in a kubernetes secret is not optimal way,

If you are looking to provide the core dump to an external user you may want to look at building a post processor by using one of these two options:

  1. Disable the built in uploader and providing your own. https://github.com/IBM/core-dump-handler/blob/main/FAQ.md#how-should-i-integrate-my-own-uploader

  2. Enable the event system and implement an external service. By setting the daemonset.eventDirectory and composer.coreEvents options in the chart an extra file is generated when a core dump is generated that can be used for post processing. This enables you to still utilise the upload feature but you may want to move the coredump to a location outside of the core environment.

shb-mll commented 1 year ago

The customer secret key is created per user in OCI. This customer secret key is a Access Key/Secret Key pair used to access the object storage in OCI via amazon s3 compatible api.

Going by the default setup of core-dump-handler they keys are base64 encoded and stored in secret s3config which is not that secure.

Thanks for the suggestions, I will check if its possible to implement the two options in my setup.

About option 2 is there more information on how exactly the additional file be used for post processing. could you share some example setups if available.

shb-mll commented 11 months ago

Hi @No9 Due to some requirements I had to downgrade my worker nodes to Oracle Linux 7 (earlier I was using oracle linux 8). I made some changes to the daemonset values (listed below) and with these changes I can see the coredump is generated by the composer however its running into an error while starting the upload step . Could you assist with this error message.

[2023-09-15T19:28:15Z INFO  core_dump_agent] Uploading: /var/mnt/core-dump-handler/cores/xxx-xxx-xxx-xx-xxx-dump-xxx-segfaulter-segfaulter-1-4.zip
[2023-09-15T19:28:15Z INFO  core_dump_agent] zip size is 29397
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidHeaderValue', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/rust-s3-0.31.0/src/request_trait.rs:434:65
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

My current setting for daemonset and composer.

        daemonset:
          crioEndpoint: "unix:///var/run/crio/crio.sock"
          hostContainerRuntimeEndpoint: "/run/crio/crio.sock"
          mountContainerRuntimeEndpoint: true
          vendor: rhel7

          extraEnvVars: |-
            - name: S3_ENDPOINT
              value: "https://{namespace}.compat.objectstorage.us-ashburn-1.oraclecloud.com"
        composer:
          logLevel: "Debug"
No9 commented 11 months ago

Hey @shb-mll This error is being thrown by the RustS3 Library

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: InvalidHeaderValue', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/rust-s3-0.31.0/src/request_trait.rs:434:65

https://github.com/durch/rust-s3/blob/7fdb685d71385152198f906068f15faaabd28592/s3/src/error.rs#L39

Looks like the Oracle objectstorage API isn't compatible with that library.

Just double checking but in your daemonset config you have replaced {namespace} with the actual namespace.

If you have configured the namespace properly then can I suggest you raise an issue/provide a fix in the rust-s3 library and we can catch it by bumping the dependency.

[Edit] Or you can provide your own uploader as discussed previously

shb-mll commented 11 months ago

@No9 yes I provided the namespace value for S3_ENDPOINT.

Also this worked in oracle linux 8, when I tested earlier on 20th May. (screenshot below) image

Also there has been no update to oracle compatibility API since 2017 https://docs.oracle.com/en-us/iaas/releasenotes/changes/0045f4a2-9afa-4f68-86b6-59dd70052ca8/

No9 commented 11 months ago

Thanks for the update - As the last release for this project was in January and the service was working in May and the error being based on the http response from the object storage service it does point to the issue being due to downstream (i.e. object storage) config issues or changes.

We don't have an Oracle Cloud account to validate or debug so any further investigation would need to come from your side.

Can I suggest as a next step you reproduce the issue by creating a standalone app that just contains the same version of the rust-s3 library at version 0.31.0. This will really help with triage.