Improve configuration of connection to Elasticsearch

smarsching commented 6 months ago

This PR introduces two major improvements:

It is now possible to use multiple URLs instead of a separately configured single hostname and port. This has two benefits: First, multiple hosts can be specified, enabling high-availability when using a multi-host Elasticsearch cluster. Second, the scheme can be specified, thus enabling the user to choose HTTPS instead of HTTP when connecting to the Elasticsearch cluster.
Credentials for connecting to the Elasticsearch cluster can be specified. This is required when security is enabled on the ES cluster. The supported methods are username / password, API key, and token-based authentication.

As a side effect, this PR also introduces three minor improvements:

The unused clusterName variable is removed from ElasticConfig.
The way how the esInitialized flag is used in ElasticConfig is changed, fixing a potential race condition.
Comments in application.properties that apparently were accidentally copied from the Elasticsearch server configuration file and were not relevant to the ChannelFinder service are removed.

smarsching commented 6 months ago

The issues that are reported by SonarCloud seem to be false positives. I cannot see anything that is wrong with the use of properties in the marked code and it works as expected.

jacomago commented 6 months ago

@smarsching You're right. Sonarcloud doesn't seem to be good at evaluating those expressions. I'll try turning them off.

For the integration tests, you might need to update the docker-compose files to get them inline with your changes. And/or the application.properties file in the test resources folder.

smarsching commented 6 months ago

@jacomago I just took a look at the logs from the integration tests, and to me it does not look like the test failure is related to my changes. I specifically designed the changes, so that configuration through the old properties is still possible (so that users do not have to adapt their working configuration when upgrading).

To me, it rather looks like a problem with the Elasticsearch service in the test environment. For example:

Search failed for {testProperty=[*]} Cause [es/search] failed: [search_phase_execution_exception] all shards failed”)

If you like, I can open a dummy PR without any real changes in order to confirm whether the integration tests go through for the unchanged code or not.

jacomago commented 6 months ago

Another question I have is why do you want to do load balancing on Channelfinder, rather than using an inbetween load balancer? I can only think of doing this if you are using elastic for multiple applications, and then the setup of a load balancer would have the benefit of affecting every application. We currently have no performance problems with Channelfinder, so I'm not quite sure the use of multiple elastic nodes.

If you don't want to use a load balancer. I would rather you split up the adding authentication and adding multiple elastic nodes into multiple commits anyway.

In addition I think you should update the docker compose files and test properties to use the authentication, so we know it works in the tests.

smarsching commented 6 months ago

Another question I have is why do you want to do load balancing on Channelfinder, rather than using an inbetween load balancer? I can only think of doing this if you are using elastic for multiple applications, and then the setup of a load balancer would have the benefit of affecting every application. We currently have no performance problems with Channelfinder, so I'm not quite sure the use of multiple elastic nodes.

This is not about performance or load-balancing but about high availability. Yes, we run multiple applications on this ES cluster, but adding a load balancer in front of the cluster would make this load balancer the single point of failure, so there would be nothing gained from having a multi-node ES cluster.

Of course we could add a fail-over setup for the load-balancer, but this would render the whole setup even more complex. In general, we have a policy of implementing failover in the client, where reasonably possible, because it reduces complexity and makes it easier to find the underlying issue when a connection fails.

As the ES client library already supports this kind of configuration, just using it seemed like the easiest approach.

If you don't want to use a load balancer. I would rather you split up the adding authentication and adding multiple elastic nodes into multiple commits anyway.

I already implemented this in multiple commits, it’s just a common branch because I needed all these changes combined for our setup. Authentication is implemented in https://github.com/ChannelFinder/ChannelFinderService/pull/127/commits/0eef1ab136afba1456c6b0da3cc506c73d8b3ba6 and the switch to URLs / supporting multiple ES hosts is implemented in https://github.com/ChannelFinder/ChannelFinderService/pull/127/commits/a75093866e2ca0dec949c1659d44b0dfb2e397b4. As written earlier, the second change preserves the ability to configure the host and port through the existing properties, so it should not affect users who don’t need it in any way.

In addition I think you should update the docker compose files and test properties to use the authentication, so we know it works in the tests.

That’s a good idea. I will look into how authentication for ES can be enabled in the test environment.

smarsching commented 6 months ago

I have added the changes, so that the test setup should use authentication in 42ff81f2fda20b62bb4f4b5ae0eff2e6c8ecd0fe.

smarsching commented 6 months ago

@jacomago Thank you very much for your code review. I addressed the issues that you pointed out as far as I could with a reasonable amount of effort.

Regarding the authentication in the test environment, the only solution that I can see would be to write our own action for starting Elasticsearch instead of using https://github.com/ankane/setup-elasticsearch, but I do not think that just testing one small aspect (that authorization headers are added correctly) does not just the amount of work needed to completely replace that action with our own code, in particular from a maintenance perspective (updating the code for future versions of Elasticsearch).

jacomago commented 6 months ago

@smarsching Figured out why the tests are failing. It is the new code. Seems like HttpHost when built from just a string, uses the string as the hostname rather than as a uri. Solution is to set the variable elasticsearch.host_urls to be a List and use a conversion in getHttpHosts like so:

HttpHost[] localHttpHosts = this.httpHosts.stream().map(HttpHost::create).toArray(HttpHost[]::new);

smarsching commented 6 months ago

@jacomago Thanks for checking the code. You are right, I broke this when I changed the code to directly inject the property into an array of HttpHost objects.

I didn’t notice it because I didn’t update our ChannelFinder instance to use the newest version of the code after this change. Now, I have updated our ChannelFinder to use the latest version of this branch and it seems to be finde, so I am reasonably sure that this problem is resolved.

jacomago commented 6 months ago

@smarsching I think maybe the problem might have been with the version of elasticsearch. I've updated it on master, so if you rebase or merge in the changes then hopefully the CI tests will pass. I think the CI tests should be passable since other actions have passed recently.

smarsching commented 6 months ago

@jacomago I just rebased my changes on top of the current master branch, but the checks still seem to fail.

Interestingly, when I run the checks locally with act, they also fail for both the master and es-connection-improvements branch.

Looking at the log output, the cause of the issues seems to have changed though: It seems like most or all tests fail now, but they consistently fail with NoClassDefFound com/fasterxml/jackson/core/StreamReadConstraints. I will try to find the cause of this exception.

smarsching commented 6 months ago

@jacomago I managed to fix the problem with Jackson: The cause was that different versions of jackson-core (2.14.2) and jackson-databind (2.16.0) were used. This problem got introduced when I rebased my changes on the latest version of the master branch.

After fixing this issue in my branch, we are now back to the original situation, where only a few tests fail. I created a separate PR #130, that only fixes the Jackson issue. If the CI checks succeed for that PR, we know that some of the other changes I made must be the cause. If the CI fails for that PR as well, the problems must also be present in the master branch.

jacomago commented 6 months ago

After fixing this issue in my branch, we are now back to the original situation, where only a few tests fail.

Just to note that the tests that fail are the ones that rely on the elasticsearch setup by the action. The tests that passed create a new elastic docker container for each test (a bit complicated I know, plans are to merge the tests).

sonarcloud[bot] commented 6 months ago

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

smarsching commented 6 months ago

@jacomago Okay, so the cause for the problems were changes that I made as part of this PR, in particular the change that I made to the initialization logic.

I changed:

        esInitialized.set(!Boolean.parseBoolean(createIndices));
        if (esInitialized.compareAndSet(false, true)) {
            config.elasticIndexValidation(client);
        }

to

         if (Boolean.parseBoolean(createIndices) && esInitialized.compareAndSet(false, true)) {
             config.elasticIndexValidation(client);
         }

The difference is that my version only runs config.elasticIndexValidation(client) once, while the original version runs it every time createClient(…) is callled. It seems that for some reason, calling this method mulitple times is needed, at least in the test context.

Under this circumstances, having the esInitialized flag does not make any sense, so now I made changes to remove it completely:

        if (Boolean.parseBoolean(createIndices)) {
            config.elasticIndexValidation(client);
        }

This way, it at least is clear that this method might be called more than once.

With these checks now succeeding, are there any other issues that need to be addressed before this PR can be merged?

jacomago commented 6 months ago

Awesome! I will merge it in.

smarsching commented 6 months ago

@jacomago Thanks for your help in figuring this issue out.

ChannelFinder / ChannelFinderService

Improve configuration of connection to Elasticsearch #127

Quality Gate passed