[DPE-4115] Performance Profile Support

phvalguima commented 1 month ago

This PR extends the current charm to support performance profiles following spec DA031 and add supports for the following profiles:

testing:
- focused in the integration tests and our CI -> 1G RAM dedicated to the Heap and no automation
staging:
- HA capabilities must be available: for that, we will enforce index template that encompasses all the indices and sets replica: 1-all
- Extends heap to: max(1G, 10% of RAM)
- indices.memory.index_buffer_size extends to 25%
- Adds three component templates, that will be described later
production:
- Same features as the staging, but heap is set instead to: max(1G, 50% of RAM)

The options above are set based on the following documents: https://opensearch.org/docs/latest/tuning-your-cluster/performance/ https://opensearch.org/docs/latest/search-plugins/knn/performance-tuning/

The user can switch between the three options above, and depending on the selected value, the templates are created or destroyed.

UPDATE

One important question about this PR is if index templates with '*' will apply to system indices. So, the first part of this answer is: index templates are only applied at index creation, as shown here. We can delete index templates after indices were created based on that template.

Manual configuration (e.g. setting 0-all) will always take precedence.

There is an exception for hidden (not necessarily system) indices: if we have templates that are "catch-all", then they are not applied to hidden indices.

phvalguima commented 1 month ago

Hi @Mehdi-Bendriss, I will go over your points one by one.

Replication factor: We should never set the replication factor to 1-all - this is unnatural for non system indices and may cause the disk of all units to overflow quickly. ... If we want to be safer, we should look for indices whose number_of_replicas < 1 and set it to 1. Not more.

Indeed, I corrected that right after our conversation earlier this week in this commit.

Although I agree with the removal of -all, thinking of real world deployments, we always go with at least 3x nodes. Other operators like SQL DBs will always have 1x main + 2x replicas in-sync when deployed by field. We remove the -all but we could have a similar experience and set it to 2, as we are in the HA type of deployment, hence at least 3 nodes will be present.

Replication factors are set by the user.

True, but on our specification: DA031 - Profile config option, it is stated that both staging and production, we will be providing a highly-available and scalable service to be used in production.. Now, that is a bit open for interpretation in my view: "highly-available" in terms of OpenSearch services (i.e. only the service indices should be set for HA by our charm) OR in terms of using OpenSearch as well (i.e. even indices created by the user).

That is why I am trying to put together a minimum "index template", that at least assures an user that indices will be replicated, unless this user explicitly states otherwise.

Codec used:

TBH, I am not entirely set on each of the values we will discuss below. That is one of the reasons I have added them as component templates, so an user will build their own index templates on top.

I'd rather benchmark these for comparison on top. Maybe getting that landed on this PR is too much. TL;DR I think we can play it safer as follows:

1) break the "component template" setup on a separate branch, for now 2) we run performance tests until we are okay with which values to suggest 3) if we are okay with the results, we can report it and update first our documentation (an user can then follow and create their own indices / index template) 4) Eventually, come back to the branch from step (1) and merge it.

WDYT?

I will start with a later question:

which one to choose?

The codec selection came from these results.

Why did you choose zstd_no_dict instead of zst

In the case of zstd_no_dict, from the results you will see above, we would sacrifice 5% compression for +7% (net) p90 latency and +7% throughput (net) when compared with zstd.

or instead qat_lz4 (in supported systems) or others?

Happy to set QAT if we get the logic to detect and enable it. We are not yet there.

How?

First, we should keep in mind some types of indices cannot simply work with all the codecs (e.g. the vector indices).

As you also noticed, I am not really onboard with these results, as there are other parameters to be set. In this case, the How?? will have to be done the same way we've done the AVX testing: by running performance tests and comparing results.

It may also be needed to set a compression algorithm with each? which one to choose? how?

Yes, I also have this concern. What I think we should do here is run these component templates against benchmarks and document their results.

Heap size: The heap size should, in production, be set to 50% but lower than 32Gb as per the official opensearch recommendations. here and here.

Thanks, that is a really good point. I will set a hard limit of up to 32G. Indeed having the GC going over 100s of Gigs does not sound good :)

Units conversion:

So, some thoughts here: (1) indeed sticking with Kb, as long as the JVM can accept the "32G" in Kb format, is a good idea and would simplify a lot; and (2) I was discussing with Big Data team could benefit from this logic here. The idea is to eventually move to the data_platform_helpers. From what I gathered, they are currently setting the JVM heap on a hard limited number rn.

The main complexity of this PR revolves around unit conversions, 2 things to note:

/proc/meminfo always returns units in Kb.

That is true as the kernel code shows.

any reason to not use psutil ? this should save you the parsing of /meminfo

Yes, two reasons: (1) the parsing goes from L589 to L598... I prefer 10 LoCs that process a file whose format is set in stone by the kernel than adding a new dependency (that actually is not shipped with stock Ubuntu anyways); and (2) it gives access to hugepages info as well, which we can potentially explore later, and would be "just there".

Jvm XMS and XMX properties accept either g|G -- m|M -- k|K, meaning the smallest unit is k. With this in mind - it makes sense to normalize the values to the smallest unit supported by both, which is the Kb and work with it exclusively (when reading, calculating or writing to file).

Yes, I was going down to the bytes and coming back but indeed makes sense to stop on Kb instead and we handle all this in Kb.

reneradoi commented 1 month ago

Hey @phvalguima I've tested the PR for large deployments. It does work when creating the deployment and also when updating the profile for the main-orchestrator. As commented in the code, updating the profile for non-main-orchestrators does not work currently.

Is this restriction supposed to stay? If so, I think we should display it to the user (e.g. with an error-message in the log) when applying the config-change on non-main-orchestrators, otherwise we would have a config mismatch and no feedback why.

canonical / opensearch-operator

[DPE-4115] Performance Profile Support #466