Netflix / hollow

Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.
Apache License 2.0
1.2k stars 216 forks source link

HollowIncrementalProducer - Execution Stats [Proposal] #148

Open rpalcolea opened 6 years ago

rpalcolea commented 6 years ago

Hi @toolbear

While working with HollowIncrementalProducer, I found that it would be nice if the result of runCycle had more information than the version. I believe users could use some stats like recordsAddedOrModified and recordsRemoved

What are your thoughts on introducing a HollowIncrementalProducerExecution object that would look like this (or HollowIncrementalProducerCycleResult?):

class HollowIncrementalProducerExecution {
   long version
   long recordsAddedOrModified
   long recordsRemoved
   long timestamp
   Status status // SUCCESS, FAIL -> could be a boolean too
}

This would help users to set some metrics or alerting based on the behavior. While Validators are a potential candidate for this there are two blockers for this:

  1. The validators are triggered by Validator.validate which is a void, so in case you need metrics, you need to bake your metrics or alerting into the validator (not a problem of course).

  2. Validator only has access to the ReadState so the detail on added/modified gets lost, while you can determine how many were added or deleted based on cardinality, the modifications detail is not available or probably yes?, only the incremental producer could know the number of adds/modified and deletes on a easy fashion because it's responsible for modifying the mutations object.

  3. I don't think a Validator should be implemented just for getting metrics on the dataset.

Don't know if this is something that you could see as useful for Hollow users. We do expose a lot of metrics, we currently track how many objects we add/modify/delete on each cycle. While the validators are helpful to fail a cycle if we want to drop for example... 3% of our data, the metrics are helping us to visualize how our dataset evolves over time.

thoughts?

ghost commented 6 years ago

My first thought is that instead of changing the return value of runCycle is to use the listener pattern like we have in HollowProducer. Because we're not on Java 8 and don't have default methods on interfaces, we're kind of hamstrung on evolving HollowProducerListener until Hollow 3 (even though HollowProducer has the "Beta API" disclaimer, it's received wide enough usage at Netflix that we probably should remove that disclaimer and consider it released and public).

However, instead of using HollowProducerListener on the composed HollowProducer, there could be a separate HollowIncrementalProducerListener. We've also been discussing having targeted listeners, e.g. ValidatorsListener or some such, instead of having to shoehorn everything into HollowProducerListener. We anticipate that HollowProducerListener will survive into Hollow 3 for the highest level or the coarsest visibility, but for more detailed information we'd have fit-for-purpose listener interfaces.

With that context, maybe it isn't called HollowIncrementalProducerListener but something like IncrementListener or IncrementalCycleListener or InsertBetterNameHere. My only hard requirement is that we avoid the "delta" nomenclature as that gets confusing with also having the notion of state change deltas and delta blobs (eventually the "Delta-based Producer Input" will get retitled in the docs once we're ready to release the incremental producer).


access to the ReadState so the detail on added/modified gets lost, while you can determine how many were added or deleted based on cardinality, the modifications detail is not available or probably yes?

For each type in your data model, it's possible to reconstruct added, modified, and deleted from a ReadStateEngine (i.e. readState.getStateEngine()) using the methods getPreviousOrdinals() and getPopulatedOrdinals() in HollowTypeReadState. Ordinals in previous that aren't in populated are either deletes or updates; ordinals in populated that aren't in previous are either adds or updates. You rely on ghost records to disambiguate. An ordinal in previous that isn't in populated that shares a primary key with an ordinal in populated that isn't in previous is an update otherwise it's a delete; the remaining ordinals in populated that aren't shared in previous and aren't modifications are adds.

only the incremental producer could know the number of adds/modified and deletes on a easy fashion because it's responsible for modifying the mutations object.

Sort of. The mutations should be thought of as requests to change, but any one of those requests could be noops. An addOrModify of an object that is no different than what is already in the current state won't actually be an add nor an update. A delete of a PK that isn't in the data won't actually be a delete.

However, having visibility into both what was requested and what actually happened is useful and could be used to spot bugs or inefficiencies when requested != actual.

I don't think a Validator should be implemented just for getting metrics on the dataset.

That's a useful workaround, but I agree it would be better if we had a designed solution.

rpalcolea commented 6 years ago

Hi @toolbear ,

Thanks for the feedback. I like the idea of IncrementalCycleListener.

What's the main motivation of making the HollowIncrementalProducer something that contains a HollowProducer. As of today, if you want to use the HollowIncrementalProducer, you need to have an existing state. Have you ever discuss the pros/cons of making HollowIncrementalProducer just another type of producer which just has another way of populate data? or let HollowIncrementalProducer handle the first cycle?

I guess for HollowIncrementalProducer would be ok to introduce the builder pattern now? and trigger the listeners at the end of runCycle?

Also, do you envision HollowIncrementalCycleListener as part of ListenerSupport? I believe it should be ok, it could be something like listeners.fireIncrementalCycleComplete(IncrementalCycleStatus status) and listeners.fireIncrementalCycleFail(IncrementalCycleStatus status) where IncrementalCycleStatus:

    public class IncrementalCycleStatus {
        private final long version;
        private final Status status;
        long recordsAddedOrModified
       long recordsRemoved
   }

And this status could be passed to the listener for logging/metrics/etc.

At the end, there should be a HollowIncrementalCycleListener interface that extends EventListener. This interface would include fireIncrementalCycleComplete and fireIncrementalCycleFail methods.

IncrementalCycleStatus object could be introduced with proper success/fail builders.

HollowIncrementalProducer's runCycle method would trigger the listeners based on the cycle result... either fireIncrementalCycleFail or fireIncrementalCycleComplete passing a new IncrementalCycleStatus.

HollowIncrementalProducer would have a Builder pattern now to do something like:

HollowIncrementalProducer.withHollowProducer(myProducer)
.withThreadsPerCpu(2.0d) //for `SimultaneousExecutor`
.withIncrementalCycleListeners(mylisteners) // 

Any thoughts/feedback would be appreciate it

rpalcolea commented 6 years ago

@toolbear , Guess code would help more -> https://github.com/Netflix/hollow/pull/153