airbnb / chronon

Chronon is a data platform for serving for AI/ML applications.
Apache License 2.0
673 stars 36 forks source link

Add disableErrorThrows to fetch group by - Ignore streaming rows or tiled IRs for decode failures #772

Closed pengyu-hou closed 2 weeks ago

pengyu-hou commented 1 month ago

Summary

Currently, the group bys are designed to be immutable. Therefore, when there is a group by schema change in place, the streaming rows or tiled IRs will have decode issues when the group by serving info is updated by the group by batch upload job.

This PR will add a param disableErrorThrows to handle the decoding errors. If disableErrorThrows=True, ignore the failed streaming rows or tiled IRs. Therefore, it won't fail loudly but with proper error log statement. For the streaming rows or tiled IRs that can be decoded successfully, it will be still included in the fetcher to construct the group by response.

If disableErrorThrows=false, and the default value is also false, it will work as is to fail loudly.

The long term goal is to enable group by schema evolution.

Why / Goal

For a smooth online serving path.

Test Plan

Checklist

Reviewers

@airbnb/zipline-maintainers

hzding621 commented 1 month ago

Discussed offline, the ignored failures on streaming write decoding might cause incorrect fetched feature data if we don't restart the streaming job when there is a schema update on GB.

For now we can rely on instrumentation to surface these issues and restart the streaming jobs if necessary for the users.

jbrooks-stripe commented 1 month ago

👋 Would it be possible to change this so that the behavior can be controlled by a feature flag or configuration? These decoding issues would cause us to surface partial data, which most of our models aren't prepared to handle, and have some use cases where users would prefer to fail hard.

pengyu-hou commented 1 month ago

👋 Would it be possible to change this so that the behavior can be controlled by a feature flag or configuration? These decoding issues would cause us to surface partial data, which most of our models aren't prepared to handle, and have some use cases where users would prefer to fail hard.

I noticed that you guys are using a FlagStore to control it here. I am curious to learn more about it and apply it here.

piyushn-stripe commented 1 month ago

@better365 - yeah it would be great to leverage the flag store for this. Here's the original flag store scaffolding pr - https://github.com/airbnb/chronon/pull/686/files And use in a PR (I see you've already started looking at that one) - https://github.com/airbnb/chronon/pull/682/files#diff-9ea959844a5006c401298a8b18ac1b693a917a310345d563a10b76558127a297R255

We could chat sync about this topic in our bi-weekly next week as well.

pengyu-hou commented 4 weeks ago

@better365 - yeah it would be great to leverage the flag store for this. Here's the original flag store scaffolding pr - https://github.com/airbnb/chronon/pull/686/files And use in a PR (I see you've already started looking at that one) - https://github.com/airbnb/chronon/pull/682/files#diff-9ea959844a5006c401298a8b18ac1b693a917a310345d563a10b76558127a297R255

We could chat sync about this topic in our bi-weekly next week as well.

Thanks @piyushn-stripe. Somehow I missed the initial PR for it. Let's sync up tomorrow and I'd like to implement one at our side as well!

pengyu-hou commented 2 weeks ago

@piyushn-stripe I made it configurable with a new param disableErrorThrows. In the long run, we can make it configurable from flag store, custom json in the config, and lastly the fetcher client.

The current approach will not change the behavior. It will first take the flag from flag store. If not present, then use disableErrorThrows.