babenkoivan / elastic-scout-driver

Elasticsearch driver for Laravel Scout
MIT License
253 stars 33 forks source link

Random connection issue #67

Open jasonfish568 opened 1 week ago

jasonfish568 commented 1 week ago
Software Version
PHP 8.1.8
Elasticsearch 8.12.1
Laravel 9.52.16
Laravel Scout 9.8.1

Describe the bug Thanks for providing this package. However I encoutered an error that I don't know how to solve. Hope you could direct me to the right way.

My app keeps getting this error randomly. I have an instance deployed on Elastic.co.

No alive nodes. All the 1 nodes seem to be down.

Initially thought it was the instance is too small so I upgraded. But still seeing this error randomly. Sometimes it happens with search operation. Initially thought it was caused by syncing 600k records. I have a dedicated queue worker running scout sync jobs. But 600k shouldn't cause such error otherwise ES is a total failure comparing to Algolia.

I reached to Elastic.co's support. They mentioned:

Upon investigating the error message further, I found that it originates from the NodePool component of the client. As described in our official documentation here, the NodePool is responsible for maintaining the list of active nodes. While nodes are generally classified as either “alive” or “dead,” there are often gray areas such as “probably dead but not confirmed” or “timed-out but unclear why.” The NodePool manages these states and ensures the client’s requests are directed to available nodes.

When the NodePool cannot find an active node to connect to, it returns a NoNodeAvailableException, which is what we’re observing here.

The number of retries typically equals the number of nodes in the cluster. For example, if your cluster has 10 nodes, and 9 nodes fail, the request will execute on the 10th node. The first 9 are then marked as “dead” and will not be retried until their “dead” timers reset.

In your case, the error message continually shows No alive nodes. All 1 nodes seem to be down even though additional nodes were added to the cluster. This suggests a potential configuration issue with the NodePool component of the client.

After checking the code, I found out that:

return [
    'default' => env('ELASTIC_CONNECTION', 'default'),
    'connections' => [
    'default' => [
        'hosts' => [
            env('ELASTIC_HOST', 'localhost:9200'),
        ],
        'httpClientOptions' => [
            'timeout' => 2,
            'headers' => [
                'Authorization' => 'ApiKey ' . env('ELASTIC_API_KEY'),
            ],
        ],
    ],
],
];

I only have one URL from Elastic.co. But my deployment is running in 2 zones so more than 2 nodes. This ELASTIC_HOST only provided one node to the config. Could this be a reason? But still, I am very confused why this error happens at all.

To Reproduce Unable to reproduce. Usually happens when syncing 600k records.

babenkoivan commented 1 week ago

Hey @jasonfish568, this package uses elasticsearch-php under the hood and the client is initialized from the configuration hash as described here.

I can recommend you try a few things:

  1. Increase the number of retries.
  2. Use the ElasticsearchResurrect option for the node pool (see the documentation).

This means your config file would look similar to:

return [
    'default' => env('ELASTIC_CONNECTION', 'default'),
    'connections' => [
    'default' => [
        'hosts' => [
            env('ELASTIC_HOST', 'localhost:9200'),
        ],
        'httpClientOptions' => [
            'timeout' => 2,
            'headers' => [
                'Authorization' => 'ApiKey ' . env('ELASTIC_API_KEY'),
            ],
        ],
        'retries' => 10,
        'nodePool' => new SimpleNodePool(new RoundRobin(), new ElasticsearchResurrect()),
    ],
],
];

Please note that you have complete control over the client creation: you can use the config file, or create your own client builder as described here.

I hope this helps 🙂

jasonfish568 commented 1 week ago

Thanks so much for the information. Let me further investigate. I think the resurrect might be the solution.