feathr-ai / feathr

Feathr – A scalable, unified data and AI engineering platform for enterprise
https://join.slack.com/t/feathrai/shared_invite/zt-1ffva5u6v-voq0Us7bbKAw873cEzHOSg
Apache License 2.0
1.98k stars 260 forks source link

Add guardrails for materialize different keys #529

Open hangfei opened 2 years ago

hangfei commented 2 years ago

In theory, we can't materialize features of different keys into same online table as tables are always keyed by same key.

In the documentation, we do warn users not to materialize different keys to same table.

Right now, if users materialize features to the same table, a random group of the same key will materialize to the table. The other different groups will be discarded. The documentation is not fool-proof enough, so we should throw exception when users do this.

The code to modify is: https://github.com/linkedin/feathr/blob/main/feathr_project/feathr/client.py#L558

glunkad commented 2 years ago

Hey @hangfei , I'm interested in contributing to this issue, so before I start working it, would you mind sparing your time explaining what the issue is about and pointing me to some resources to get started.

hangfei commented 2 years ago

Hi @9gl , thanks for your interest in this project!

Let me explain here a bit first. You can also join our slack channel(on bottom of homepage, there is a link) to find other developers there. If you prefer zoom meetings etc, we can do that as well.

This is the user guide for this API: https://github.com/linkedin/feathr/blob/main/docs/concepts/feature-generation.md.

We compute and write(materialize) the features into online databases, like Redis.

When we materialize the feature, we can materialize a list of features :

settings = MaterializationSettings("nycTaxiMaterializationJob",
                                   sinks=[redisSink],
                                   feature_names=["f_location_avg_fare", "f_location_max_fare"])

Here we have two features: ["f_location_avg_fare", "f_location_max_fare"].

They will be roughly be like this in the redis database/cache: table: nycTaxiMaterializationJob entity key f_location_avg_fare, f_location_max_fare 123, "13,30" 237, "12,99" 983, "33,11" (here the entity_key is f_location_id) So you can see they should be the same entity key. If you have another id called trip_id, then it's not possible to put it into same table.

So if we found that users are doing this, we should throw error while users call MaterializationSettings API.

There are one more complex case as well: compound key. An entity that has two keys, like [userId, productId]. We need to validate them as well.

Let me know if my explanation is not clear enough. Feel free to ping me here or on Slack.

westofwest commented 2 years ago

Hi @hangfei, I am interested in contributing to this project. Could you point me to where the change needs to be made?