canonical / seldon-core-operator

Seldon Core Operator
Apache License 2.0
5 stars 9 forks source link

Finalise work done for Seldon Metrics Discovery during Obeservability Workshop #68

Open i-chvets opened 1 year ago

i-chvets commented 1 year ago

Finalise work done for Seldon Metrics Discovery during Obeservability Workshop

Work items are tracked in https://warthogs.atlassian.net/browse/KF-829 Branch: https://github.com/canonical/seldon-core-operator/tree/kf-829-gh68-feat-metrics-discovery Prometheus deployment https://github.com/canonical/prometheus-k8s-operator

Design

Failure alerts are implemented through integration with Prometheus Charm from Canonical Observability Stack. For metrics provided by models targets can change from model to model and from deployment to deployment. Metrics Endpoint Observer provided by COS is integrated. Updates to targets are handled by Mertics Endpoint Observer and relayed to Prometheus by Seldon Core Operator Charm.

Testing

App Version Status Scale Charm Channel Rev Address Exposed Message prometheus-k8s 2.33.5 active 1 prometheus-k8s stable 79 10.152.183.239 no
seldon-controller-manager active 1 seldon-core 0 10.152.183.182 no

Unit Workload Agent Address Ports Message prometheus-k8s/0 active idle 10.1.59.80
seldon-controller-manager/0
active idle 10.1.59.79


Deploy model with custom metrics:

microk8s.kubectl -n test apply -f examples/echo-metrics-v1.yaml

Get IP address of model classifier and use it for prediction request:

microk8s.kubectl -n test get svc | grep echo-metrics-default-classifier echo-metrics-default-classifier ClusterIP 10.152.183.34 9000/TCP,9500/TCP 25m

Request prediction using IP address of model classifier:

for i in seq 1 10; do sleep 0.1 && \ curl -v -s -H "Content-Type: application/json" \ -d '{"data": {"ndarray":[[1.0, 2.0, 5.0]]}}' \ http://:9000/predict > /dev/null ; \ done


Metrics are available at pod's IP address and Prometheus port:

microk8s.kubectl -n test describe pod echo-metrics-default-0-classifier-5bf6cf86cd-r7c8l IP: 10.1.59.82 PREDICTIVE_UNIT_METRICS_SERVICE_PORT: 6000 PREDICTIVE_UNIT_METRICS_ENDPOINT: /prometheus

curl http://10.1.59.82:6000/prometheus



- Navigate to Prometheus dashboard `https://<Prometheus-unit-IP>:9090`, select **Status**->**Targets**
i-chvets commented 1 year ago

Currently blocked by COS: require some changes to Prometheus.

i-chvets commented 1 year ago

Dec 9, 2022 debug session notes: I am hitting juju tools issue: juju-run is not installed in my pod, logs from /var/log/discovery.log:

FileNotFoundError: [Errno 2] No such file or directory: '/var/lib/juju/tools/unit-seldon-controller-manager-0/juju-run'

I guess, my observer is not even starting because of this. And this is on the pod:

# ls -la /var/lib/juju/tools/unit-seldon-controller-manager-0/juju-run
ls: cannot access '/var/lib/juju/tools/unit-seldon-controller-manager-0/juju-run': No such file or directory

It does live on pod though, but in different place:

# ls -la /usr/bin/juju-run 
lrwxrwxrwx 1 root root 25 Dec 13 19:44 /usr/bin/juju-run -> /charm/bin/containeragent
DnPlas commented 1 year ago

@i-chvets I believe this is done, please confirm.

i-chvets commented 1 year ago

Merged https://github.com/canonical/seldon-core-operator/pull/94

i-chvets commented 1 year ago

Testing work is still required. Not completed yet. By following steps in description, no metrics could be retrieved. Needs more debugging. Currently, the above setup fills in 32GB of RAM and swap space and the whole setup becomes unresponsive.