feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.53k stars 993 forks source link

Coalesce FeatureRows for improved "latest" value consistency in serving stores #88

Closed tims closed 5 years ago

tims commented 5 years ago

Is your feature request related to a problem? Please describe. Now that we plan to store features in the serving store as Latest only, instead of the full history. We need to add logic to ensure writes of feature values from older rows don't clobber new ones. This is especially a problem for batch, but we'd like to solve it generally for streaming as well.

Describe the solution you'd like For batch streams, we will combine feature rows by key. Where the key is entity name + entity key.

For streaming, we will also do this, but we emit rows with a timer, keeping the previous value in state for future combines. We will evict state with another timeout timer, this will allow for very large key infrequently updated key spaces to be less demanding.

The feature rows should be stored in the warehouse store in the combined form, not as their raw separate rows, because we want it to reflect the rows as they are written to the serving stores. This ensures that a the equivalent of a Select * AS OF, query would reflect what was available in the serving stores.

We should also make sure that we can turn this feature off as an option in each import spec, as this could introduce performance hits in very high throughput streams with large backlogs. And is also not required for batch jobs that know they only have one row per entity key.

Describe alternatives you've considered Alternatives are to keep each feature row separately in the serving stores and combine them in the serving api (which is what we started with). This is not as performant as many use cases require, and we'd prefer to optimize for serving speed. We are willing to accept that we cannot guarantee true consistency using the above approach, as it's possible that data already in the store may conflict with data being ingested in a new job. But we will assume that incoming feature data is eventually consistent and will overwrite any issues.

Additional context The PR I'm working on for this issue is WIP and will require #87 to be finished first, as we should be grouping on the {entityName}+{entityKey}, not with the event timestamp + granularity. It would be more work to deal with the latest values in the stores store separately, so it's best we wait.

tims commented 5 years ago

/assign tims

tims commented 5 years ago

Should we have this enabled or disabled by default?

zhilingc commented 5 years ago

I don't think you should coalesce rows for the warehouse store. The warehouse store should have the most complete dataset possible, for all timestamps, regardless whether the feature rows were late or not.

pradithya commented 5 years ago

Should we have this enabled or disabled by default?

I think it should be ON by default as it impact the correctness of ingestion result

tims commented 5 years ago

Pushed change to PR so it defaults to true.

Also, @zhilingc the PR no longer coalesce's rows writing to the warehouse, they get written as is.

tims commented 5 years ago

Closed with merge of Coalesce rows PR #89