Open samhita-alla opened 8 months ago
🤖
Hello @samhita-alla! I'm Dosu, a friendly bot here to assist you while we wait for a human maintainer. I can help with bugs, answer questions, and even guide you to become a contributor. Just know I'm here to make things easier for you. Let's tackle this together!
Sure, I can provide some clarifying notes about the data lifecycle in Flyte.
Flyte handles data in two parts: Metadata and Raw data.
Metadata consists of data about inputs to a task, and other artifacts. It is configured globally for FlytePropeller, FlyteAdmin etc., and the running pods/jobs need access to this bucket to get the data.
Raw data is the actual data (such as the Pandas DataFrame, Spark DataFrame, etc.). Raw data paths are unique for every execution, and the prefixes can be modified per execution. None of the Flyte control plane components would access the raw data. This provides great separation of data between the control plane and the data plane.
For every task that receives input, Flyte sends an Inputs Metadata object, which contains all the primitive or simple scalar values inlined, but in the case of complex, large objects, they are offloaded and the Metadata simply stores a reference to the object.
Flytekit TypeTransformers make it possible to use complex objects as if they are available locally - just like persistent filehandles. But Flyte backend only deals with the references. Thus, primitive data types and references to large objects fall under Metadata - Meta input or Meta output, and the actual large object is known as Raw data.
Metadata in Flyte is critical to enable the passing of data between tasks. It allows to perform in-memory computations for branches or send partial outputs from one task to another or compose outputs from multiple tasks into one input to be sent to a task.
Flyte is primarily a DataFlow Engine. It enables movement of data and provides an abstraction to enable movement of data between different languages. The workflow engine is responsible for moving data from a previous task to the next task. As explained previously, Flyte only deals with Metadata and not the actual Raw data.
Flytekit has a pluggable data persistence layer. This is driven by PROTOCOL. For example, it is theoretically possible to use S3 s3:// for metadata and GCS gcs:// for raw data. It is also possible to create your own protocol my_fs://, to change how data is stored and accessed. But for Metadata, the data should be accessible to Flyte control plane. Data persistence is also pluggable. By default, it supports all major blob stores and uses an interface defined in Flytestdlib.
In the current version of the Flyte codebase, Metadata and Raw data are handled as follows:
Metadata is serialized and deserialized using the marshalMetadata
and unmarshalMetadata
functions respectively. These functions are not shown in the provided context, but they are likely responsible for converting the metadata to and from a format that can be stored in the database. The serialized metadata is stored in the SerializedMetadata
field of the models.Artifact
struct.
Raw data is represented as models.ArtifactData
and is stored in the ArtifactData
field of the models.Artifact
struct. The CreateArtifactModel
function takes a slice of models.ArtifactData
as an argument, which is then assigned to the ArtifactData
field of the models.Artifact
struct that is returned.
The FromArtifactModel
function is used to convert a models.Artifact
back into a datacatalog.Artifact
. This involves unmarshalling the serialized metadata and converting the models.ArtifactData
back into the format used by the datacatalog.Artifact
.
The ToArtifactKey
function is used to create an ArtifactKey
from a datacatalog.DatasetID
and an artifact ID. The ArtifactKey
is used to uniquely identify an artifact.
Flyte ensures the separation of Metadata and Raw data in the current version of the codebase by using different data structures and serialization methods for each type of data.
In the provided code, the CreateArtifactModel
function is used to create an artifact model. This function takes in a CreateArtifactRequest
and an array of ArtifactData
as parameters. The CreateArtifactRequest
contains the metadata for the artifact, while the ArtifactData
contains the raw data.
The metadata is serialized using the marshalMetadata
function, which converts the metadata into a byte array. This serialized metadata is then stored in the SerializedMetadata
field of the Artifact
model.
On the other hand, the raw data is stored directly in the ArtifactData
field of the Artifact
model without any serialization.
This separation of metadata and raw data allows for more efficient storage and retrieval of data. The metadata can be quickly accessed and searched without having to load the potentially large raw data. Meanwhile, the raw data can be loaded only when necessary, reducing memory usage and improving performance.
You can find more details in the Flyte documentation and the Flyte repository.
I hope this helps! If you have any more questions, feel free to ask.
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.
Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.
If you want to continue the conversation, start your reply with @dosu-bot.
Description
document to update: https://docs.flyte.org/en/latest/concepts/data_management.html
what to add: https://discuss.flyte.org/t/15995893/hi-community-i-am-trying-to-understand-the-flyte-data-lifecy#b374eea0-0c32-44fd-875a-5b79d4a8180b
Are you sure this issue hasn't been raised already?
Have you read the Code of Conduct?