SeldonIO / seldon-core

An MLOps framework to package, deploy, monitor and manage thousands of production machine learning models
https://www.seldon.io/tech/products/core/
Other
4.4k stars 832 forks source link

Seldon Pipeline Inspect fails with error `Failed to resolve 'seldon-kafka-0.seldon-kafka-brokers.seldon-mesh.svc:9092'` #5176

Open stephaniegaspar opened 11 months ago

stephaniegaspar commented 11 months ago

Currently we have installed Seldon Core v2 with version 2.6.0 on a k8s cluster. Based on the installation instructions of kafka on our k8s cluster, we followed the recommendation using Strimzi Operator with all default values from helm chart strimzi-kafka-operator (version 0.35.1) and all default values from seldon's helm chart seldon-core-v2-kafka (version 0.1.0). All control plane and data plane operations are running with the security protocol PLAINTEXT.

The goal of this issue is to find out what is the best solution to inspect output data from each model within a Pipeline.

Describe the bug

We built the seldon cli image on our machine and we created a custom configuration on /home/.config/seldon/cli with the following:

{
    "kafka": {
      "bootstrap": "localhost:9092",
      "namespace": "seldon-mesh",
      "protocol": "PLAINTEXT",
      "sasl":{
        "username": "seldon",
        "password": ""
      }
    },
    "dataplane": {
      "inferHost": "localhost:9000"
    },
    "controlplane": {
      "schedulerHost": "localhost:9004"
    }
}

We're port forwarding seldon-mesh, seldon-scheduler and seldon-kafka-bootstrap services to those ports configured above.

We followed the example described here.

We can make an inference through seldon cli, but when we try to do the command seldon pipeline inspect tfsimples the following error appears on console:

%3|1703095841.652|FAIL|rdkafka#consumer-1| [thrd:seldon-kafka-0.seldon-kafka-brokers.seldon-mesh.svc:9092/0]: seldon-kafka-0.seldon-kafka-brokers.seldon-mesh.svc:9092/0: Failed to resolve 'seldon-kafka-0.seldon-kafka-brokers.seldon-mesh.svc:9092': nodename nor servname provided, or not known (after 2ms in state CONNECT)
%3|1703095841.652|FAIL|rdkafka#consumer-1| [thrd:GroupCoordinator]: GroupCoordinator: seldon-kafka-0.seldon-kafka-brokers.seldon-mesh.svc:9092: Failed to resolve 'seldon-kafka-0.seldon-kafka-brokers.seldon-mesh.svc:9092': nodename nor servname provided, or not known (after 2ms in state CONNECT)
%4|1703095841.652|OFFSET|rdkafka#consumer-1| [thrd:main]: seldon.seldon-mesh.model.tfsimple1.inputs [0]: offset reset (at offset TAIL(1) (leader epoch -1), broker 0) to offset END (leader epoch -1): failed to query logical offset: Local: Host resolution failure
Error: seldon-kafka-0.seldon-kafka-brokers.seldon-mesh.svc:9092/0: Failed to resolve 'seldon-kafka-0.seldon-kafka-brokers.seldon-mesh.svc:9092': nodename nor servname provided, or not known (after 2ms in state CONNECT)

We've debugged the code of seldon cli and discovered the error is returned on this line.

Related bug encountered with the same error: https://github.com/SeldonIO/seldon-core/issues/4776.

Expected behaviour

We were expecting to return the pipeline kafka topics with a JSON format of data outputs of those kafka partitions.

Environment

lc525 commented 8 months ago

Just a note here: the kafka bootstrap server, which you are pointing to, returns a list of hostnames pointing to the actual kafka brokers. Because those are not visible/accessible from where you are running the seldon cli, the cli cannot connect to them and the error you've described pops up.

In other words, it's not sufficient to simply expose the kafka bootstrap server via a port forward.

I don't think this is a bug per-se, but we'll take it as an improvement request that the seldon pipeline inspect command should work better with the kafka install from within a k8s cluster.