Multiple statefulsets for vmstorage nodes on vmcluster

Currently operator creates one statefulset for all vmstorage nodes and scales it accordingly. This approach is good until something bad happens to one of storage nodes. For example: It is not possible to turn off one of vmstorage nodes (for restore purpose) We cannot keep the corrupted vmstorage for diagnosis while adding another one to keep cluster running. We cannot change resource limits seperately. This is specially needed during incidents for temporarily adjustments. Moreover, we have seen that some storages behave differently in our cluster, this can happen because of different computing infrastructure or maybe a flaw in load balancing in shard ingestion or search loads. We propose that the operator deploy each storage as one separate sts. In this way maintenances would be simpler, and also there would be a way to set different resoruce configuration on vmcluster CRD object. You might have concerns about the simplicity. We can think of a more felxible representation of vmcluster. If vmoperator can support different nodepools in defining vmcluster it can keep it simple to new users as well as allowing people for more advanced designs. For example:

vmstorage:
  nodepools:
    - name: pool1
       replicaCount: 2
       resources: {}
       tolerations: {}
       nodeSelector {}
    - name: pool2
       replicaCount: 1
       resources: {}
       tolerations: {}
       nodeSelector {}

We propose that the operator deploy each storage as one separate sts. In this way maintenances would be simpler, and also there would be a way to set different resoruce configuration on vmcluster CRD object.

To be clear, the goal you want to achieve here: let vminsert/vmselct under the same vmcluster using all the vmstorage nodepools at once, right?

You might have concerns about the simplicity

Yeah, does sound complicate things here. To me, it's sounds like to merge multiple vmcluster definition into one and I'm not sure it's worth it.

It is not possible to turn off one of vmstorage nodes (for restore purpose)

If it's for temporary restore job, you can stop the operator pod and make vminsert to stop writing to that storage pod by removing it from -storageNode.

We cannot keep the corrupted vmstorage for diagnosis while adding another one to keep cluster running.

Statefulset should do the trick if you just increase the vmstorage.replicaCount.

We cannot change resource limits seperately. This is specially needed during incidents for temporarily adjustments.

That's true, but I think that's the point of replica, to have multiple pods with same configuration, cause you never know vminsert or vmselect gonna request which one in advance when they are under one service, and we need to make sure each of them have the same ability to handle requests.

I support this issue too. StatefulSet does not support changing one pod's configuration despite the stateful nature and it has been a long running thread in k8s. This really happens in production that you are forced to treat a stateful pod as a pet and there is no workarounds. Some other operators also support this kind of flexibility in design eg. Elasticsearch operator. It does not complicate things in my opinion because the previous case could be achieved by just defining one nodepool. The benefit is huge. Although I think using separate sts for each storage should be the default behavior but I'm not sure how one could tell vmoperator to turn one off for example. We can use something like maintenanceInsertNodeIDs but maybe this nodepool design is better and more declarative than comma separated string. Or should we have stopReconcileNodeIDs? What if you see vmselect/vminsert/vmstorage nodes all as an abstract nodePool concept that has different type? would that be easier?

The first and the most important use-case for ours is that we need to change one pod at a time for maintenance or during incidents. For example, it happens often that we need to turn off one instance and migrate it to another node in the cluster. Incidents are less likely than that but I can bring examples of my past experience maintaining vmclusters. There was a case we had a corrupted disk and one storage node went down. we scaled up to the cluster to have time for investigation, after the investigation we had no good way to scale down again and we ended up having one storage listed in both maintenanceInsertNodeIDs and maintenanceSelectNodeIDs and it did indeed complicate things in a way that we had to shutdown the cluster and migrate last storage to the missing and turn all on again. This could be a lot easier if we had the ability to treat them differently. In some other case we had problems with rerouting and we could add cpu+memory in short term to stabilize the vmcluster however we didn't have disk as much as other replicas and despite it was unnecessary at the moment, unfortunately the design wouldn't let us add capacity.

@Haleygo

To be clear, the goal you want to achieve here: let vminsert/vmselct under the same vmcluster using all the vmstorage nodepools at once, right?

yes, as we activate and deactivate nodepools in configuration, instead of using maintenanceInsertNodeIDs or changing vmselect -storageNode.

Yeah, does sound complicate things here. To me, it's sounds like to merge multiple vmcluster definition into one and I'm not sure it's worth it.

Actually, the simplicity here creates much complexity during maintenance and incidents, in case of scaling we should scale the attached disk for all nodes at the same time(vertical), or we should have a whole new storage. or even turning off the operator and manipulating objects like vminsert deployment for restoring data is another case that will complicate maintenance.

That's true, but I think that's the point of replica, to have multiple pods with same configuration, cause you never know vminsert or vmselect gonna request which one in advance when they are under one service, and we need to make sure each of them have the same ability to handle requests.

As I have mentioned, in addition to prior issues, another problem is some storage nodes behave differently, This can happen because of different computing infrastructures or maybe a flaw in load balancing in shard ingestion or search loads. and it would be much easier to maintain this diversity by adjusting sufficient resources as we monitor storage nodes' performance.

I think this format of configuration will also simplify mulit-level vmcluster in Kubernetes, now in order to chain vminserts together we should manually change -storageNode as an extra args that will override what operator should set. with the proposed configuration we can add new vminserts as a noodpool (with another format). behind the nodepool idea there's a need to easily attach and detach storage nodes with different configurations, is there any better way to implement this?

Hey @mohammadkhavari.

Let's discuss your key points, I'm trying to understand more precisely what specific problems you want to solve.

It is not possible to turn off one of vmstorage nodes (for restore purpose)

It's possible with maintenanceInsertNodeIDs and maintenanceSelectNodeIDs. Why is this method not appropriate?

We cannot keep the corrupted vmstorage for diagnosis while adding another one to keep cluster running.

Why? You can add corrupted node to maintenanceInsertNodeIDs and maintenanceSelectNodeIDs and increase replicas.

We cannot change resource limits seperately. This is specially needed during incidents for temporarily adjustments. Moreover, we have seen that some storages behave differently in our cluster, this can happen because of different computing infrastructure or maybe a flaw in load balancing in shard ingestion or search loads.

Yes, you can't. But here it's worth understanding why:

if this is done on vmstorage running in the cluster, we consider it an anti-pattern - all the cluster logic is designed for the instances to be the same.
If this is done on a vmstorage excluded from the cluster, it is not clear why it should be done.

Yes, it probably reduces flexibility a bit, but that's the price for using the understandable and predictable k8s built-in features.

We propose that the operator deploy each storage as one separate sts. In this way maintenances would be simpler, and also there would be a way to set different resoruce configuration on vmcluster CRD object. You might have concerns about the simplicity. We can think of a more felxible representation of vmcluster. If vmoperator can support different nodepools in defining vmcluster it can keep it simple to new users as well as allowing people for more advanced designs.

I disagree, the maintenance would be more complicated, It is more difficult to work with separate unrelated STSs than with a single one. About different configuration for instances I already answered above.

Hey @Sin4wd.

I support this issue too. StatefulSet does not support changing one pod's configuration despite the stateful nature and it has been a long running thread in k8s. This really happens in production that you are forced to treat a stateful pod as a pet and there is no workarounds.

I think that it's a feature, not disadvantage 🙂 I believe that such a feature due to misuse would make vmstorages more fragile, prone to errors due to misconfiguration.

Some other operators also support this kind of flexibility in design eg. Elasticsearch operator.

I'm not familiar with it enough, but it's probably appropriate there because nodes are unequal and can have different roles.

Although I think using separate sts for each storage should be the default behavior but I'm not sure how one could tell vmoperator to turn one off for example. We can use something like maintenanceInsertNodeIDs but maybe this nodepool design is better and more declarative than comma separated string

Yes, such a feature already exists and is used. Did I understand correctly that the problem is only in its format?

Or should we have stopReconcileNodeIDs? What if you see vmselect/vminsert/vmstorage nodes all as an abstract nodePool concept that has different type? would that be easier?

Sorry, i didn't understand that idea about stopReconcileNodeIDs, but the ability to disable reconciliation of one particular resource sounds like a nice feature to me to perform debugging and diagnostics.

The first and the most important use-case for ours is that we need to change one pod at a time for maintenance or during incidents. For example, it happens often that we need to turn off one instance and migrate it to another node in the cluster.

Could you please provide more details about your actions to achieve it?

Incidents are less likely than that but I can bring examples of my past experience maintaining vmclusters. There was a case we had a corrupted disk and one storage node went down. we scaled up to the cluster to have time for investigation, after the investigation we had no good way to scale down again and we ended up having one storage listed in both maintenanceInsertNodeIDs and maintenanceSelectNodeIDs and it did indeed complicate things in a way that we had to shutdown the cluster and migrate last storage to the missing and turn all on again. This could be a lot easier if we had the ability to treat them differently.

Very cool that you give real life scenarios! Thanks! In this scenario, all the actions seem to sound correct, but I didn't understand what exactly you shutdown the entire cluster for? Why couldn't all actions be performed on running nodes added to maintenanceInsertNodeIDs and maintenanceSelectNodeIDs?

In some other case we had problems with rerouting and we could add cpu+memory in short term to stabilize the vmcluster however we didn't have disk as much as other replicas and despite it was unnecessary at the moment, unfortunately the design wouldn't let us add capacity.

Yeah, that does sound like a valid case. We're not sure if it's worth doing such a feature for the sake of it yet, but we'll discuss it again.

Hey @Amper ,thank you for your meticulously crafted response.

I think that it's a feature, not disadvantage 🙂

I understand your point of view but the need still exists. You can look at Advanced Statefulset of Open Kruise as an example of people trying to solve real world problems that couldn't be addressed with sts.

I'm not familiar with it enough, but it's probably appropriate there because nodes are unequal and can have different roles.

Yes that's the main idea. But victoriametrics components are also different. Are you considering allowing vmselect/vmstorage/vminserts to be different CRDs that select each other with selectors? I can make an issue and explain why I think this really helps in practice. For example, one can attach two different set of vmselect with different configurations to a set of existing vmstorages without taking the burden of configuring -storageNode manually.

I believe that such a feature due to misuse would make vmstorages more fragile, prone to errors due to misconfiguration.

I understand that you used sts on purpose, and I'm proposing cases in which this approach limits our operations. If unification is your concern I'm almost ok with that if we can support stopReconcileNodeIDs alongside having different sts per node(I will explain my migration case below). In this way you can prevent people, specifying different spec for each node in the operator, yet allowing for operations flexibility.

About migration case, the steps are somehow like below: Think of a case that we have vmstorage-0-0,vmstorage-1-0,,vmstorage-2-0 and trying to migrate vmstorage-0-0 to another host.

put vmstorage-0-0 in maintenanceInsertNodeIDs and stopReconcileNodeIDs
create a new persistent volume on destination-host calleddpv
snapshot vmstorage-0-0's pv and send it to dpv.
put vmstorage-0-0 in maintenanceSelectNodeIDs
turn off vmstorage-0-0 by scaling sts/vmstorage-0 to 0 replicas and send pv's difference from previous snapshot to the dpv.
attach vmstorage-0-0's pvc to dpv
turn vmstorage-0-0 back on.
remove all maintenance flags and stopReconcile as well.

(Notice that some of these steps cannot be done when a pod is using the pvc/pv)

what exactly you shutdown the entire cluster for? Why couldn't

we couldn't turn off only one. It would be much easier to have something like skipOrdinal and stopReconcileNodeIDs+seperate sts to ignore and turn off the corrupted node.

the ability to disable reconciliation of one particular resource sounds like a nice feature to me to perform debugging and diagnostics

Yes that's a great feature to have. My workaround is turning off operator and remove its finalizers, now and then during operations!

So far so good. Only one issue remains in this case. I have seen actual cases with exactly similar replicas of vmstorage in exactly similar infrastructure and cluster that consume different cpu percentage and we couldn't figure out why. I couldn't reproduce it in our test environment and also couldn't seperate the misbehaving node from the cluster for investigation. This keeps me from making an issue about that because it was really hard to test on production when all nodes are generated from same sts and couldn't be isolated. I agree that treating vmstorages as replicas is a good decision but their similarity happens to be wrong and we are stuck in a no-win situation in those case. At least we could have a workaround option in the meantime on the operator to figure out the main vmstorage issue.

We are still facing whole cluster shutdowns for migrating one node. This is simply because we cannot shutdown just one storage at a time. Can we forget about nodepool and just create seperate sts for each storage? I would be happy if can contribute to this end.

Do you think of any other solution for this purpose?

VictoriaMetrics / operator

Multiple statefulsets for vmstorage nodes on vmcluster #741