flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.45k stars 583 forks source link

[Docs] document clarifying notes about the data lifecycle #4683

Open samhita-alla opened 8 months ago

samhita-alla commented 8 months ago

Description

document to update: https://docs.flyte.org/en/latest/concepts/data_management.html

what to add: https://discuss.flyte.org/t/15995893/hi-community-i-am-trying-to-understand-the-flyte-data-lifecy#b374eea0-0c32-44fd-875a-5b79d4a8180b

Are you sure this issue hasn't been raised already?

Have you read the Code of Conduct?

dosubot[bot] commented 8 months ago

🤖

Hello @samhita-alla! I'm Dosu, a friendly bot here to assist you while we wait for a human maintainer. I can help with bugs, answer questions, and even guide you to become a contributor. Just know I'm here to make things easier for you. Let's tackle this together!

Sure, I can provide some clarifying notes about the data lifecycle in Flyte.

Flyte handles data in two parts: Metadata and Raw data.

Metadata consists of data about inputs to a task, and other artifacts. It is configured globally for FlytePropeller, FlyteAdmin etc., and the running pods/jobs need access to this bucket to get the data.

Raw data is the actual data (such as the Pandas DataFrame, Spark DataFrame, etc.). Raw data paths are unique for every execution, and the prefixes can be modified per execution. None of the Flyte control plane components would access the raw data. This provides great separation of data between the control plane and the data plane.

For every task that receives input, Flyte sends an Inputs Metadata object, which contains all the primitive or simple scalar values inlined, but in the case of complex, large objects, they are offloaded and the Metadata simply stores a reference to the object.

Flytekit TypeTransformers make it possible to use complex objects as if they are available locally - just like persistent filehandles. But Flyte backend only deals with the references. Thus, primitive data types and references to large objects fall under Metadata - Meta input or Meta output, and the actual large object is known as Raw data.

Metadata in Flyte is critical to enable the passing of data between tasks. It allows to perform in-memory computations for branches or send partial outputs from one task to another or compose outputs from multiple tasks into one input to be sent to a task.

Flyte is primarily a DataFlow Engine. It enables movement of data and provides an abstraction to enable movement of data between different languages. The workflow engine is responsible for moving data from a previous task to the next task. As explained previously, Flyte only deals with Metadata and not the actual Raw data.

Flytekit has a pluggable data persistence layer. This is driven by PROTOCOL. For example, it is theoretically possible to use S3 s3:// for metadata and GCS gcs:// for raw data. It is also possible to create your own protocol my_fs://, to change how data is stored and accessed. But for Metadata, the data should be accessible to Flyte control plane. Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib.

In the current version of the Flyte codebase, Metadata and Raw data are handled as follows:

Metadata is serialized and deserialized using the marshalMetadata and unmarshalMetadata functions respectively. These functions are not shown in the provided context, but they are likely responsible for converting the metadata to and from a format that can be stored in the database. The serialized metadata is stored in the SerializedMetadata field of the models.Artifact struct.

Raw data is represented as models.ArtifactData and is stored in the ArtifactData field of the models.Artifact struct. The CreateArtifactModel function takes a slice of models.ArtifactData as an argument, which is then assigned to the ArtifactData field of the models.Artifact struct that is returned.

The FromArtifactModel function is used to convert a models.Artifact back into a datacatalog.Artifact. This involves unmarshalling the serialized metadata and converting the models.ArtifactData back into the format used by the datacatalog.Artifact.

The ToArtifactKey function is used to create an ArtifactKey from a datacatalog.DatasetID and an artifact ID. The ArtifactKey is used to uniquely identify an artifact.

Flyte ensures the separation of Metadata and Raw data in the current version of the codebase by using different data structures and serialization methods for each type of data.

In the provided code, the CreateArtifactModel function is used to create an artifact model. This function takes in a CreateArtifactRequest and an array of ArtifactData as parameters. The CreateArtifactRequest contains the metadata for the artifact, while the ArtifactData contains the raw data.

The metadata is serialized using the marshalMetadata function, which converts the metadata into a byte array. This serialized metadata is then stored in the SerializedMetadata field of the Artifact model.

On the other hand, the raw data is stored directly in the ArtifactData field of the Artifact model without any serialization.

This separation of metadata and raw data allows for more efficient storage and retrieval of data. The metadata can be quickly accessed and searched without having to load the potentially large raw data. Meanwhile, the raw data can be loaded only when necessary, reducing memory usage and improving performance.

You can find more details in the Flyte documentation and the Flyte repository.

I hope this helps! If you have any more questions, feel free to ask.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.