feast-dev / feast

The Open Source Feature Store for Machine Learning
https://feast.dev
Apache License 2.0
5.62k stars 1k forks source link

How to disable Feast's serialization format #4471

Open fcas opened 2 months ago

fcas commented 2 months ago

Is your feature request related to a problem? Please describe. Would it be possible to disable the serialization process?

Describe the solution you'd like An option to disable the serialization process, for exemple: -1.

Describe alternatives you've considered One additional idea to consider might be the development of an extension for a customized serialization process.

Additional context Currently, writing to the DynamoDB online store involves a serialization step. While DynamoDB offers a convenient CSV export feature, the exported data retains Feast's serialization format. This creates a tight coupling between our data export and Feast's internal serialization process, which could lead to inflexibility and potential issues if Feast's serialization changes in the future. Ref: https://github.com/feast-dev/feast/blob/master/docs/specs/online_store_format.md

paulochf commented 2 months ago

While DynamoDB offers a convenient CSV export feature, the exported data retains Feast's serialization format.

Also, DynamoDB offers a convenient CSV import through the Dynamo Import from S3.

However, the serialization step still impairs any straightforward data movements from/to Dynamo and could be optional or worked around for such processes.

franciscojavierarceo commented 2 months ago

We could add an additional field with the unserialized data and just not use it for retrieval. Could be configured at feature view level.

Would you be open to contributing this?

FYI @tokoko

tokoko commented 2 months ago

Hey, thanks for the issue. Unfortunately, this is a bit more complicated than disabling a serialization step. They way Feast's pluggable OnlineStore interface is set up right now (dynamo is one of it's implementations), online store already expects values to write in a serialized protobuf format (online_write_batch method) and the data that it returns is also in that proto format (online_read method). So, in order to accomplish this, we would need to add additional ser/de step to and from plaintext or change the interface which is not that easy either.

Additionally, imho csv import/export functionality however convenient they maybe is really not the reason to consider a change like this. The whole point of an online store is to enable a low-latency data retrieval and data layout used needs to be optimized for performance. To be fair, in the case of dynamodb specifically I'm not sure how impactful storing values as protos are, but the point is that this sort of a change should ideally target retrieval/ingestion performance, or at least show that they won't be impacted.

We could add an additional field with the unserialized data and just not use it for retrieval. Could be configured at feature view level.

idk, sounds like a waste of both compute and storage. I don't think we should do that in the main repo even if it could be configurable. It could be done as an external online store implementation, though.

paulochf commented 2 months ago

The whole point of an online store is to enable a low-latency data retrieval and data layout used needs to be optimized for performance.

To be fair, in the case of dynamodb specifically I'm not sure how impactful storing values as protos are, but the point is that this sort of a change should ideally target retrieval/ingestion performance, or at least show that they won't be impacted.

Exactly. If I understood correctly what Feast does and how it does it, it's a platform that enables you to manage and keep it in sync a feature store between offline and online layers, and DynamoDB is one of them.

From reading the docs (please correct me if I'm mistaken), one should set up a Spark engine to execute that sync process. We opened the thread because such a sync process seems complicated and slow when the S3 Import is a functional DynamoDB feature. While I do agree that it creates a cloud lock-in, I also see it as a chance for improvement on the framework's end since it already abstracts between AWS, GPC, and local.

Also, we want the data to be as performant as possible, but I still couldn't find in the documentation why serialization would improve data retrieval. We would love to understand it better if you wouldn't mind explaining it here.

franciscojavierarceo commented 2 months ago

Here's the ChatGPT response which I agree with:

Feast stores data in the online store using Protobuf (Protocol Buffers) rather than raw values for several key reasons:

1. Efficient Serialization/Deserialization:

2. Versioning and Backward Compatibility:

3. Cross-Language Support:

4. Data Validation and Schema Enforcement:

5. Compression and Performance:

In short, Protobuf in Feast helps ensure efficiency, consistency, flexibility, and performance in online feature retrieval and storage, which is key for serving machine learning features at scale.

franciscojavierarceo commented 2 months ago

We could probably create an offline deserializer, but that has backfilling requirements too.

HaoXuAI commented 2 months ago

Serialization and deserialization are essential when storing data of any type. You likely haven’t encountered this issue because you're using a NoSQL online store. However, for SQL databases, you must define the column and its type before reading or writing any data.

For example, if I have a money value like $10.00 and want to store it in the database, I need to create the table with a specific schema, such as CREATE TABLE (money FLOAT) before I can insert the value. But if I also need to store a value like a username, I would need a new schema, for instance, CREATE TABLE (user_name STRING). This essentially means creating separate tables or fields for each value type.

On the other hand, I could use a generic type like BINARY to store different data types in the same table. In this scenario, serialization/deserialization is needed to convert values between their original types and the stored binary format.

This challenge is common in database development. A widely used approach to handle this is storing values in formats like Protobuf or Thrift. Another option is using a common data format like Apache Arrow + ADBC, though it's not universally compatible with all databases, especially NoSQL systems.

I hope this helps clarify your question.