apache / helix

Mirror of Apache Helix
Apache License 2.0
466 stars 227 forks source link

Ephemeral Instance Support #1888

Closed subprotocol closed 2 months ago

subprotocol commented 3 years ago

Feature request I have a use case that involves ephemeral instances that come and go.

The issue is on cleanup (say an instance dies) the instance information is retained under zookeeper /CLUSTER/INSTANCES/instanceID.

Describe the solution you'd like

Would it be feasible to optionally allow instance znodes to be ephemeral? Instead of using helixAdmin.addInstance to create an instance, perhaps something like getZKHelixManager could also take an InstanceConfig for InstanceType.PARTICIPANT—and the manager would create the ephemeral znode. When the Manager dies, the znode would be automatically cleaned up.

I was curious if anyone else thought this make sense? As I see things, this would be very useful for supporting clusters that have elastic instances.

Thanks

junkaixue commented 3 years ago

@subprotocol Under /CLUSTER/INSTANCE/ folder are the ones for heavy metadata of continuous connection. It will be a huge traffic to create those all for every connection and lost historical data for that instance if there is any application related operation, such as deployment, taking off for repair. The example for historical data such as connection history, or last time current state and something else.

Can we understand why you need this? We can think about other solutions to support you requirement.

subprotocol commented 3 years ago

Can we understand why you need this? We can think about other solutions to support you requirement.

Absolutely! I would like to run a portion of my Helix participants on spot instances. Spot instances allow one to cheaply scale processing when there is spare capacity. The downside is that spot instances come and go, and when exactly is beyond my control. Storage is disaggregated, so rebalancing helix resources is not expensive.

I am not sure how best to configure helix for this model. If host machines are fixed, when a host first starts up it can register itself as a helix instance and use that as the participant.

When the hosts are not fixed (spot instances), as the spot instances come and go, I would end up with a forever increasing set of helix instances in zookeeper.

I understand and share your concern about high traffic on the zookeeper path. Would a better strategy be to try and re-use helix instances that are offline when a spot instance first starts up? Or is there a better strategy that I'm not seeing?

Thank you @junkaixue!

jiajunwang commented 3 years ago

@subprotocol Thanks for the idea. If my memory is accurate, I discussed a similar design with my team before. There are some concerns regarding race condition that makes the implementation more complicated than we expected. So we end up implement HelixAdmin.purgeOfflineInstances for the immediate requests which cleans up offline&expired instances. Unfortunately, it is not an automatic mechanism so it won't fit your needs 100%. Moreover, "Please note that the purge function should only be called when there is no new instance joining happening in the cluster."

subprotocol commented 3 years ago

@jiajunwang thank you for the suggestion. I added a periodic check to the leading controller and this works great!

I see one possible improvement to purgeOfflineInstances. I am using helix instance tags to tag my spot instances. It would be nice if purgeOfflineInstances could take a set of tags, and only purge those tagged instances. This would allow me to purge the spot instances only, while also mitigating the race condition for other instances.

I would like to propose adding a second purgeOfflineInstances method as follows:

// proposed addition
void purgeOfflineInstances(String clusterName, long offlineDuration, Set<String> instanceTags);

Does this make sense—and would this functionality be welcome as a small contribution?

Thank you!

jiajunwang commented 3 years ago

I vote up this idea.

But it might be concerning if we introduce more parameters to the purge method. There will be a backward compatibility concern as well. It might help to define a "criteria" class with the builder pattern so the interface itself and the user code whoever needs this API is more future proof. Just my 2 cents, FYI.

subprotocol commented 3 years ago

@jiajunwang I agree, a purge instance criteria builder is a great idea here. Below is a rough sketch of the required change as I see things.

// Add criteria builder
public class HelixPurgeInstanceCriteriaBuilder {
  public HelixPurgeInstanceCriteriaBuilder withSecondsOffline(int seconds);
  public HelixPurgeInstanceCriteriaBuilder withTags(Set<String> tags);
  public HelixPurgeInstanceCriteria build();
}

// Add purge instance criteria
public class HelixPurgeInstanceCriteria {
  int getSecondsOffline();
  Set<String> getTags();
}

HelixAdmin:

// Add to HelixAdmin interface
void purgeOfflineInstances(String clusterName, HelixPurgeInstanceCriteria purgeCriteria);

ZkHelixAdmin:

// Override and implement in ZkHelixAdmin
void purgeOfflineInstances(String clusterName, HelixPurgeInstanceCriteria purgeCriteria);

// For backwards compatibility—keep old purgeOfflineInstances method signature as-is, replace the
// implementation with new builder pattern that calls purgeOfflineInstances(clusterName, purgeCriteria)

Example Usages:

// Purge instances that have been offline for at least 1 hour and tagged
// with both: "worker" and "spot-instance"
HelixPurgeInstanceCriteria purgeCriteria = new HelixPurgeInstanceCriteriaBuilder()
  .withSecondsOffline(3600)
  .withTags(Set.of("worker", "spot-instance"))
  .build();

admin.purgeOfflineInstances("MyCluster", purgeCriteria);
// Purge instances that have been offline for 1 day
HelixPurgeInstanceCriteria purgeCriteria = new HelixPurgeInstanceCriteriaBuilder()
  .withSecondsOffline(86400)
  .build();

admin.purgeOfflineInstances("MyCluster", purgeCriteria);
// Helix does not allow dropping instances that are online
HelixPurgeInstanceCriteria purgeCriteria = new HelixPurgeInstanceCriteriaBuilder()
  .build(); // Throws exception with message stating that withSecondsOffline is required

Clarifying Point long offlineDuration is currently used on purgeOfflineInstances; It's not clear on the current method signature or docs that this is in milliseconds. I thought withSecondsOffline would be more clear and intuitive.

Drop vs. Purge HelixAdmin currently has these methods for dropping/purging:

Though dropping a live instance is not supported now, in the future it could be? If so dropInstances(clusterName, dropCriteria) may make more sense than purgeOfflineInstances(clusterName, purgeCriteria). In this world, if/when Helix supports dropping live instances, the builder would no longer throw an exception when withSecondsOffline isn't specified.

In my mind this feels more future proof. I wanted to float this idea out there. If this makes more sense I can refine the example code above.

I welcome any suggestions. Once aligned, would a PR be a sufficient starting point? Or should a JIRA be created to track this?

junkaixue commented 3 months ago

Since we support auto registration of instances. We are planning to support auto deregistration.