Ramen catalog fails to report healthy in drenv, potentially due to olm installation differences

ShyamsundarR commented 1 year ago

This is a problem that was reported earlier by @nirs that the method to get ramen catalog and bundles installed via OLM on a minikube cluster as described here does not work.

Subsequent testing with and without drenv resulted in the following conclusion:

In a vanilla minikube cluster, if the steps are followed as laid out AND olm installed using operator-sdk, the ramen bundle gets installed and the operator starts running
In drenv if the steps are followed, the pod created for the catalog source in the ramen-system namespace crashes with errors like so: Error: open db-118615996: permission denied
- this leads to the Subscription not resolving to fetch and install the bundle as the CatalogSource remains unhealthy with a TRANSIENT_FAILURE
In the same drenv created cluster, if operator-sdk was used to uninstall and then install olm again, the scheme as before starts working.

The issue seems to either be the version of olm installed by drenv (0.22) or the manner of installing the same (although steps seems to follow the upstream olm install procedure as laid out). This needs further investigation and a fix, in case operator-sdk is not going to be used to install olm.

Another alternative could be to try using the install script provided part of the olm releases to install and ensure our catalog works. This also seems to be less work at our end to install, than go through installing various manifests one after the other.

ShyamsundarR commented 1 year ago

Another alternative could be to try using the install script provided part of the olm releases to install and ensure our catalog works. This also seems to be less work at our end to install, than go through installing various manifests one after the other.

Tried the above method, with 0.22.0 version it still failed. With 0.23.1 version it worked as expected. For now we should move to 0.23.1 (or use operator-sdk for latest version install, which is usually a bad idea anyway) to overcome this issue.

A deeper analysis may throw up what the actual problem is/was, but the above should be enough to make forward progress with bundles in the e2e system.

nirs commented 1 year ago

@Shwetha-Acharya do you want to take this issue? This should be a trivial change and good learning task.

Testing this is building the ramen bundle and installing it in the clusters as described in the install guide.

ShyamsundarR commented 1 year ago

After pr #729 was merged, the bundles now work with the olm version 0.22 that is installed by drenv, I suspected the opm versions in use, so potentially updating that has helped.

So we do not need to shift versions as long as it is not required. Feel free to close this issue if needed.

nirs commented 1 year ago

Nice! but do we have any reason to pin version 0.22?

I think it is better to always use the latest release, this way if a new release breaks us, the tests will discover this early, hopefully before users experience the breakage.

ShyamsundarR commented 1 year ago

Nice! but do we have any reason to pin version 0.22?

Not necessary.

I think it is better to always use the latest release, this way if a new release breaks us, the tests will discover this early, hopefully before users experience the breakage.

We should pin it to a released version, during the course of development to not have to deal with instability from the dependents.

Closer to a ramen release the latest released version to ensure non-breakage.

nirs commented 1 year ago

Updating depenedencies right before release is too risky. I think it will be safer to update our dependencies when we start new development cycle, for example after rleasing upstream version. With this we know that the release version was tested with certain dependncies during development.

For the next release I think it should be good enough to upgrade olm now since we don't have any upstream users yet.

nirs commented 1 year ago

I think before we upgrade olm we need to understand why we don't one of the official ways to install olm:

Using operator-sdk https://olm.operatorframework.io/docs/getting-started/

Using the install script:

Install Operator Lifecycle Manager (OLM), a tool to help manage the Operators running on your cluster.

$ curl -sL https://github.com/operator-framework/operator-lifecycle-manager/releases/download/v0.24.0/install.sh | bash -s v0.24.0

This is part of the instructions for installing an operator, shown when clicking the "Install" button in operatorhub.io, for example in https://operatorhub.io/operator/minio-operator.

Then either change our minio installation, or document why we cannot use one of the official ways.

nirs commented 1 year ago

Before we change olm install, we need olm self test (olm/test).

The test should install an example operator that is quick to install and check that the operator is deployed properly.

It should pass with current code based on @ShyamsundarR report, and with the olm deploy code and olm version.

RamenDR / ramen

Ramen catalog fails to report healthy in drenv, potentially due to olm installation differences #745