Not able to deploy Xgboost model using Xgboost Prepackaged Inference Server through Seldon Operator

altruistcoder commented 2 years ago

Describe the bug

Hello,

I have been working with Seldon from quite some time and have been able to deploy multiple models using different pre-packaged inference servers provided by Seldon. But, from past two days I haver started facing a problem with the Seldon deployments on my openshift cluster. I am trying create a Seldon Deployment Instance of the Seldon Operator to deploy a Xgboost model using the Xgboost Prepackaged Inference Server. But, I am getting one of the below two errors every time I try to create the respective instance object:

Error "failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.seldon-operator.svc:443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": EOF" for field "undefined".

Error "failed calling webhook "v1alpha2.vseldondeployment.kb.io": Post "https://seldon-webhook-service.seldon-operator.svc:443/validate-machinelearning-seldon-io-v1alpha2-seldondeployment?timeout=30s": no endpoints available for service "seldon-webhook-service"" for field "undefined".

I have also observed that when I try to create this object, the seldon-controller-manager pod, is getting into OOMKilled state and then restarts by itself. Although, I am not able to see any identifiable errors in the logs of the pod.

Also, I am able to deploy models using the Tensorflow and the Sklean pre-packages Inference Servers.

Can you please help me in resolving the issue as soon as possible as I am not able to deploy my required models due to this?

To reproduce

Deploy any xgboost model using the Xgboost Inference Servers.

Expected behaviour

The Xgboost model should deploy successfully in the openshift cluster.

Environment

Cloud Provider: Openshift
Openshift Cluster Version: 4.8.36

ukclivecox commented 2 years ago

You may want to look at updating the resources for the controller: see https://github.com/SeldonIO/seldon-core/blob/7a94dbfe354cef2ada1b6a563f7acf66328463df/helm-charts/seldon-core-operator/values.yaml#L81-L85

altruistcoder commented 2 years ago

Hello @cliveseldon ,

It is true that updating the resources is solving the problem but I am facing some difficulties in making this change a permanent change.

Actually, the Controller is installed using the Openshift Seldon Operator and not using Helm. So, I am not sure how to increase its resources permanently because if I try to make a change in seldon-controller-manager Deployment to increase resources, it shows me this error:

So, can you please tell me how can I make this change permanent by changing it directly into operator itself?

ukclivecox commented 2 years ago

Can you make a change to the resource in the openshift operator for Seldon. @RafalSkolasinski

RafalSkolasinski commented 2 years ago

@altruistcoder On OpenShift you should be able to edit the CSV of the Seldon Operator directly. I believe this would be the section you are after https://github.com/redhat-openshift-ecosystem/certified-operators/blob/main/operators/seldon-operator-certified/1.14.1/manifests/seldon-operator-certified.clusterserviceversion.yaml#L544-L550

altruistcoder commented 2 years ago

@cliveseldon @RafalSkolasinski Yes, I can see this configuration in the CSV of the Seldon Operator in my cluster in lines 681-687.

But, we have many models which are already deployed using Seldon Operator and are being used frequently in multiple namespaces in our cluster. So, will this change be affecting any existing/running model deployments? Or will it affect any webhooks or change any other default setting which might cause an issue?

If Yes, what can I do to avoid that? and if No, can you please tell me since the Operator was installed cluster-wide, I should make the changes in the CSV in the "seldon-operator" namespace only, right?

ukclivecox commented 2 years ago

This should not affect the models but as said you should test on your dev cluster to confirm.

SeldonIO / seldon-core