apache / solr-operator

Official Kubernetes operator for Apache Solr
https://solr.apache.org/operator
Apache License 2.0
243 stars 112 forks source link

Role Assignment for SOLR Replicas #618

Open ozlerhakan opened 9 months ago

ozlerhakan commented 9 months ago

Hi Team!

I was wondering if we have possibility to assign different roles for replicas, each of which is essentially a SOLR node, using SOLR's Node Roles feature?

HoustonPutman commented 9 months ago

This is a very good question, but ultimately it would be a fairly big change to the way that the operator works.

Currently the Solr Operator uses a single StatefulSet to manage the pods for each SolrCloud instance. The statefulSet has a single pod spec, so pods are meant to be identical. That means we can't give different envVars (to set the node role) to different pods.

One option would be to have the Solr Operator manage a solr cloud in multiple StatefulSets, so that each statefulSet could have a different podSpec to . This is definitely something that could be done, but would be a fairly big undertaking. If you list your use case, and how you imagine you would want to configure the node roles, then it's something we could think about for v0.9.0 or v1.0.0.

janhoy commented 9 months ago

In SIP-14 we discuss adding a zookeeper noe-role. Then it would make sense to start a separate statefulSet for the zk pods. They may need different role, different RAM, disk, CPU limit etc. The same could be true for other node-roles such as "data", "coordinator" and future roles. So perhaps if the operator had a way to configure optional group of pods with different combination of node-roles, and each would get its own statefulset. You could then have a group of pods which are optimized for querying (pull replica) etc.

ozlerhakan commented 9 months ago

Thank you @janhoy and @HoustonPutman for your interest in this question.

The primarily idea is to create a SolrCloud environment where we maintain a large dataset under 3+ distributed Solr instances with the backup-restore feature at hand. This mechanism is experimental and we conduct extensive tests on it. Additionally, I want to experiment with another scenario where shard-leaders and replicas reside in separate and isolated instances within the cluster. In this setup, leader-shards are responsible for updates and backup-restore tasks, while replicas can handle heavy users requests. However, I'm still unsure if this is the best practice and feasible when it comes to the SolrCloud concept.

Returning to my initial question, although my understanding of how the logic behind assigning node roles works is not solid at the moment, I was expecting a Solr API for nodes to assign new roles to running instances in the cluster at runtime as there is a Roles API that allows us to list some node level details at cluster level. It seems that adding a different startup flag to each Solr instance using the Solr Operator is not a trivial task as Houston mentioned.

janhoy commented 9 months ago

I believe the role assignment is a startup property on purpose, designed to be a more static feature of a node that you don't want to be messed with at runtime. It would also be much more complex to, say, remove the "data" role from a node in-flight, would it then instantly evict all its cores, etc?

ozlerhakan commented 9 months ago

designed to be a more static feature of a node that you don't want to be messed with at runtime.

Why would someone mess things up if they have the right API calls to arrange their cluster's nodes roles before populating any data and releasing it to production? Apart from this, one could also mess things up with some of the existing Solr APIs too. For instance, the Restore API lets us restore a backup even if there is a collection with the same name in the cluster. My point is, depending on how we approach, you might need to do some pre-checks before messing the whole cluster.

would it then instantly evict all its cores, etc?

So what happens "if I restart a data:on Solr instance that holds data, but is now data:off?"

janhoy commented 9 months ago

There was a fairly thorough discussion on the design here https://lists.apache.org/thread/39q48fbtcxcl9btg24twj9bwowlro16d - perhaps you find the answer to the particular decision.

All in all, would be great to have support in the operator!

ozlerhakan commented 9 months ago

Indeed, It looks like a long and back and forth discussion, and I also noticed some of you shared my opinion. Well, at this point, it seems you have already made a conclusion about this SIP, what is done is already done. However, It would be excellent if you consider another option for defining an instance's role next to envVars in the future.

Regarding the main question, we'll likely have to rely on the operator to have support for specific envVars declarations in the solrOpts field within the SolrCloud CRD spec. Perhaps, the operator could introduce a new way to label parameters with the use of a mix of strings --<metadata.name> + <solrcloud-kind-name> + <instance-number>--. Therefore, we can assign specific flags to individual instances. For instance, if metadata.name:test and the cluster has 2 replicas/instances, they can be defined as test-solrcloud-0:-Dsolr.node.roles=coordinator:on,data:off test-solrcloud-1:-Dsolr.node.roles=coordinator:off,data:on It is just my first impression and it could also be done in another way.

ozlerhakan commented 9 months ago

When do you think you can consider this enhancement to be part of one of the next releases @HoustonPutman ?

HoustonPutman commented 8 months ago

Maybe for v0.9.0. Using the roles API, it would be more feasible. We already do some defaulting when Pods are created (no API calls though). So it is feasible to add another defaulting method, which works with a readinessCondition.

Basically when a pod is started, the nodeRoleInitialized readinessCondition will be empty. The operator would have code to say, while the Solr container is ready and the nodeRoleInitialized readinessCondition is empty, call the roles API. When the API call is successful, set the nodeRoleInitialized readinessCondition to true. That way we have a stateful and idempotent way of handling the API calls.

How do you imagine this would be integrated into the SolrCloud CRD?

ozlerhakan commented 8 months ago

Sorry that I totally missed your message. I'm unsure whether this could be applicable from the implementation perspective. Would it be possible to introduce another handy CRD for custom Solr settings that interacts with the SolrCloud CRD at pod startup?