apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
243 stars 112 forks source link

Solr Prometheus Export - unable to connect to live Solr cloud #620

Closed brickpattern closed 9 months ago

brickpattern commented 9 months ago

Issue: Endpoints /admin/ping get 404 Setup: Solr Operator : v0.7.1 ZK: inbuilt Solr Cloud: v9.2.1 All modules running in same namespace Solr and Pods all mapped to specific K8S Node via podOptions ( Tolerations ).

Solr Prometheus Exporter was deployed after the above setup was working independantly.

Startup Error:


INFO  - 2023-09-12 16:03:01.295; org.apache.solr.common.cloud.ConnectionManager; Client is connected to ZooKeeper
WARN  - 2023-09-12 16:03:01.295; org.apache.solr.common.cloud.SolrZkClient; Using default ZkACLProvider. DefaultZkACLProvider is not secure, it creates 'OPEN_ACL_UNSAFE' ACLs to Zookeeper nodes
INFO  - 2023-09-12 16:03:01.305; org.apache.solr.common.cloud.ZkStateReader; Updated live nodes from ZooKeeper... (0) -> (3)
INFO  - 2023-09-12 16:03:01.325; org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider; Cluster at solr-solrcloud-zookeeper-0.solr-solrcloud-zookeeper-headless.solr.svc.cluster.local:2181,solr-solrcloud-zookeeper-1.solr-solrcloud-zookeeper-headless.solr.svc.cluster.local:2181,solr-solrcloud-zookeeper-2.solr-solrcloud-zookeeper-headless.solr.svc.cluster.local:2181 ready
INFO  - 2023-09-12 16:03:01.337; org.apache.solr.prometheus.exporter.SolrExporter; Starting Solr Prometheus Exporting on port 8080
INFO  - 2023-09-12 16:03:01.338; org.apache.solr.prometheus.collector.SchedulerMetricsCollector; Beginning metrics collection
INFO  - 2023-09-12 16:03:01.354; org.apache.solr.prometheus.exporter.SolrExporter; Solr Prometheus Exporter is running. Collecting metrics for cluster f0bf8c7193: Solr Cloud ZK: solr-solrcloud-zookeeper-0.solr-solrcloud-zookeeper-headless.solr.svc.cluster.local:2181,solr-solrcloud-zookeeper-1.solr-solrcloud-zookeeper-headless.solr.svc.cluster.local:2181,solr-solrcloud-zookeeper-2.solr-solrcloud-zookeeper-headless.solr.svc.cluster.local:2181/
INFO  - 2023-09-12 16:03:01.500; org.apache.solr.client.solrj.impl.CloudSolrClient; request was not communication error it seems
INFO  - 2023-09-12 16:03:01.500; org.apache.solr.client.solrj.impl.CloudSolrClient; Request to collection [my.collection] failed due to (500) org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at null: null

request: GET, retry=0 maxRetries=5 commError=false errorCode=500 
ERROR - 2023-09-12 16:03:01.500; org.apache.solr.prometheus.scraper.SolrScraper; failed to request: /admin/ping => org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request
    at org.apache.solr.client.solrj.impl.LBSolrClient$ServerIterator.nextOrError(LBSolrClient.java:220)
org.apache.solr.client.solrj.SolrServerException: No live SolrServers available to handle this request
    at org.apache.solr.client.solrj.impl.LBSolrClient$ServerIterator.nextOrError(LBSolrClient.java:220) ~[solr-solrj-9.2.1.jar:9.2.1 a4c64ab6a2a270ca69c28c706dabb2927ed8a7c2 - jsweeney - 2023-04-24 11:35:31]
...
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) ~[?:?]
    at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at null: null

Further down in the error list able to see

org.apache.solr.client.solrj.impl.BaseHttpSolrClient$RemoteSolrException: Error from server at http://solr-solrcloud-0.solr:80/solr: null

Questions : [ ] - Is the /admin/ping its hitting is to Solr Cloud instance (and not to ZK instance)? Dont see a domain prefix. [ ] - Is /admin/ping endpoint request independant of http://solr-solrcloud-0.solr:80/solr ? [ ] - Why is it attempting to hit the http://solr-solrcloud-0.solr:80/solr which happens to be a collections endpoint?

brickpattern commented 9 months ago

Add-on information ...

From within the Solr Prometheus Exporter pod and from within the Solr-SolrCloud-0 pod , able to hit http://solr-solrcloud-0.solr:80/solr/admin/metrics . So it connectivity is thru. Only the endpoint here seems to be the issue.

eg.

solr@solr-prom-exporter-solr-metrics-5547f945c8-6vx67:/tmp$  curl http://solr-solrcloud-0.solr/solr/admin  
<p>
  Searching for Solr?<br/>
  You must type the correct path.<br/>
  Solr will respond.
</p>
solr@solr-prom-exporter-solr-metrics-5547f945c8-6vx67:/tmp$  curl http://solr-solrcloud-0.solr/solr/admin/metrics
{
  "responseHeader":{
    "status":0,
    "QTime":8},
  "metrics":{
    "solr.jetty":{
      "org.eclipse.jetty.server.handler.DefaultHandler.1xx-responses":{
        "count":0,
        "meanRate":0.0,
        "1minRate":0.0,
        "5minRate":0.0,
        "15minRate":0.0},
      "org.eclipse.jetty.server.handler.DefaultHandler.2xx-responses":{
        "count":183632,
        "meanRate":0.16544911773059542,
...
HoustonPutman commented 9 months ago

The issue is ping, which is a request handler in Solr. /solr/admin/ping. Have you setup your Solr cloud such that it doesn't have a ping handler?

That error could just mean that Solr isn't available (errors, garbage collection, etc).

Also I would recommend trying with Solr 9.3, which is less likely to have a bug.

brickpattern commented 9 months ago

I'm on Solr version 9.2.1 how to enable Solr Admin Ping server? Yes, im mostly running default settings with changes to sizing / capacity only.
The deployment was thru values.yaml referring to the standard helm repository

the only reference to "admin" was under zk:

    # Customize the ZK services
    adminServerService: {}
    clientService: {}
    headlessService: {}
HoustonPutman commented 9 months ago

This isn't related to ZK, and it should be enabled by default. So either it was a bug that was fixed in Solr 9.3 or something caused your Solr node to become unavailable. Do you get the error intermittently or every time the prometheus exporter scrapes solr?

brickpattern commented 9 months ago

from within SolrCloud pod,

/opt/solr-9.2.1$     curl http://localhost:8983/solr/admin/info 
{
  "responseHeader":{
    "status":404,
    "QTime":0},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"No handler by name info available names are [system, threads, logging, health, properties]",
    "code":404}}
brickpattern commented 9 months ago

This isn't related to ZK, and it should be enabled by default. So either it was a bug that was fixed in Solr 9.3 or something caused your Solr node to become unavailable. Do you get the error intermittently or every time the prometheus exporter scrapes solr?

yes the Error is always (continuously it hits those endpoints and errors)

brickpattern commented 9 months ago

came across this ... https://solr.apache.org/guide/8_3/ping.html

if this is same as what the Prometheus Exporter is trying to hit ...

tested it got 503 from SolrCloud-0 instance. So the endpoint is available

{
  "responseHeader":{
    "zkConnected":true,
    "status":500,
    "QTime":0,
    "params":{
      "q":"solrpingquery",
      "distrib":"false",
      "distribute":"true",
      "qt":"search",
      "rows":"10",
      "echoParams":"all",
      "rid":"solr-solrcloud-0.solr-4573"}},
  "error":{
    "metadata":[
      "error-class","org.apache.solr.common.SolrException",
      "root-error-class","org.apache.solr.common.SolrException"],
    "msg":"Ping query caused exception: no field name specified in query and no default specified via 'df' param",
    "trace":"org.apache.solr.common.SolrException: Ping query caused exception: no field name specified in query and no default specified via 'df' param
brickpattern commented 9 months ago

version 9.3 seemed to have worked for Prometheus Exporter module. Not seeing error any longer.