flyteorg / flyte

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
https://flyte.org
Apache License 2.0
5.81k stars 661 forks source link

[Core Feature] Logical types: static type checking for higher level user defined types. #1363

Open kumare3 opened 3 years ago

kumare3 commented 3 years ago

Motivation: Why do you think this is important? Flytekit and in the future other SDK's support progressive typing and allowing users to define their types. The TypeTransformers today in flytekit, effectively result in type-erasure at runtime. The higher level types are converted to underlying flyte types and on retrieval the information of the source type is lost. This works in theory as the receiving sdk, has the right types defined. It also helps in easy type-casting types into all of its derivative types. This technique has been successfully deployed to a language like Java and the JVM.

Examples of type derivatives Convert from Spark Data frame -> Flyte.Schema -> Pandas data frame.

But, it is desirable to keep the source type available so that we can recover the type, even without explicitly requesting for this type.

Example: remote.get().outputs.x -> can be correctly casted if available

Moreover, one problem with type erasure is loss of static type checking across languages or different tasks.

To overcome this problem the issue proposes we introduce a new type called the LogicalType, which keeps information about the source and the transport type associated.

Goal: What should the final outcome look like, ideally? Users can specify new types, and we can reverse engineer those types from the stored definition. Helps in debugging, static type assertions, optimizations and helps extensibility

Describe alternatives you've considered What exists today - type erasure!

[Optional] Propose: Link/Inline OR Additional context -- from @kanterov Logical type is a type alias for an existing LiteralType, and values for logical types are represented with existing Literal. Logical types can correspond to built-in or user-defined types in SDK. A logical type is defined as (this approach is inspired by Apache Beam proto):

message LogicalType {
  // Required. Unique resource name for LogicalType.
  // There is a list of well-known logical types supported by SDKs, 
  // and users can add their own
  string urn = 1; 

  // Required. Existing LiteralType used to represent values of LogicalType
  LiteralType representation = 2;

  // Optional. Additional argument for logical type. May be used to serialize additional information
  Literal argument = 3;

  // Optional. Type of argument.
  LiteralType argument_type = 4;
}

Example of urn

pandas.DataFrame, pyspark.DataFrame

Semantics Type t1 is supertype of logical type t2, iff: t1 is strictly equal to t2 t1 is supertype of t3, and t3 is supertype of t2 t1 is supertype of t2.representation

This allows us to read unknown logical types using their representation. E.g. if task_1 produces output: LogicalType(representation=INTEGER) and task_2 has input of INTEGER, it’s possible to bind task_2.input to task_1.output. However, it isn’t possible to do the opposite: use any INTEGER as LogicalType(representation=INTEGER).

SDKs have a list of well-known logical types that are mapped to built-in or custom types. flyteconsole or flytectl can have a special behaviour for well-known logical types.

flytepropeller shouldn’t introduce a special behaviour for well-known logical types when doing type-checking. This limitation of logical types allows the introduction of new logical types without all components of Flyte being aware of it. When there is an unknown logical type, it should be safe for implementation to fallback to it’s representation.

Examples of well-known logical types

Example: introducing INT32 flyteidl has an INTEGER type that is 64-bit integer. It’s natural for SDK users to use 32 bit integers unless they need 64 bits. In Java, there are two separate types: Integer and Long representing 32 and 64 bit integers. However, it creates a problem because a 32 bit integer can overflow when trying to fit 64 bits. Introducing logical type for INT32 allows tasks to read INT32, only if input is bound to a literal that is known to be INT32.

kumare3 commented 3 years ago

One more requirement: LogicalTypes should be able to support meta-outputs that are associated with the core type. For example, when you run a Great Expectations assertion, the result is a markdown or an HTML file that has the results. Flyte does support having multiple outputs for a task, but some outputs can automatically be associated with some meta data, like in this case the test suite results. IMO, the LogicalTypes should carry this information with them and FlyteConsole or other clients can show them separately if needed.

Another example could be a DataSet that has an index with it. this is like a Multipart Blob, but also has an additional file (e.g. a JSON or a CSV) that contains the list of all elements in the multi-part. this can help in providing transactional semantics on multipart directories

EngHabu commented 3 years ago

Yee, Ketan, Eduardo and I met to discuss more about Logical Types and get on the same page... This is the type changes we are proposing:

The goal of Logical Types is to enable different SDKs to reason about higher level types the same way. For example, if users define a BigInt type in flyteKit, the LiteralType should have enough information for FlyteKit Remote to map the literal type back to a python's BigInt class... This also allows flyteConsole to have special visualizations for BigInt.

Additional changes:

  1. Add metadata field to literal message. This allows us to attach struct/json metadata to the Literals (e.g. Great Expectations validation on an input)
  2. Add a field to the LiteralType to indicate whether it's an optional type. (e.g. this allows Optional[int] in python)

message LogicalTypeInfo {
  // Required. Unique resource name for LogicalType.
  // There is a list of well-known logical types supported by SDKs, 
  // and users can add their own
  string urn = 1; 

  string friendly_name = 2;

  string origin_type = 3;

  map<string, string> labels = 4;

  // do we need to add argument? argument_type?
}

// Defines a strong type to allow type checking between interfaces.
message LiteralType {
    oneof type {
        // A simple type that can be compared one-to-one with another.
        SimpleType simple = 1;

        // A complex type that requires matching of inner fields.
        SchemaType schema = 2;

        // Defines the type of the value of a collection. Only homogeneous collections are allowed.
        LiteralType collection_type = 3;

        // Defines the type of the value of a map type. The type of the key is always a string.
        LiteralType map_value_type = 4;

        // A blob might have specialized implementation details depending on associated metadata.
        BlobType blob = 5;

        // Defines an enum with pre-defined string values.
        EnumType enum_type = 7;
    }

    // This field contains type metadata that is descriptive of the type, but is NOT considered in type-checking.  This might be used by
    // consumers to identify special behavior or display extended information for the type.
    google.protobuf.Struct metadata = 6;

    LogicalTypeInfo logical_type = 7;
}
github-actions[bot] commented 1 year ago

Hello 👋, This issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will close the issue if we detect no activity in the next 7 days. Thank you for your contribution and understanding! 🙏

kumare3 commented 1 year ago

Maybe we should keep this issue in a deprioritized state

github-actions[bot] commented 6 months ago

Hello 👋, this issue has been inactive for over 9 months. To help maintain a clean and focused backlog, we'll be marking this issue as stale and will engage on it to decide if it is still applicable. Thank you for your contribution and understanding! 🙏