VizierDB / vizier-scala

The Vizier kernel-free notebook programming environment
Other
34 stars 11 forks source link

Proposal: Extensible Artifact Model #318

Open okennedy opened 3 months ago

okennedy commented 3 months ago

Challenge

Vizier's current data model is:

  1. Tightly coupled to Apache Spark: This brings in a 600MB dependency (technically 1.2GB, since pip ends up importing it a second time for Python compatibility).
  2. Very ad-hoc: Type translations are developed organically, on an as-needed basis.
  3. Reliant on 'canonical' types: Every data value has a canonical type. This often necessitates redundant, or unnecessarily proactive translations, most commonly with the Dataset type. For example, instead of easily allowing Pandas to interpret a LoadDataset('csv') with pd.load_csv; we have to go through spark.
  4. No notion of multiple-role objects. For example, a CSV file is a file, but could also represent a dataframe defined over the file. Presently, it's possible to have both, but you need separate artifacts for each.
  5. No support for transient artifacts --- artifacts created temporarily as cache.

Proposal Summary

  1. Provide Interfaces, Implementations, and Rust-Style Into[]/From[] adaptors; mainly with an eye towards decoupling how Vizier and language servers interact with artifacts (Interfaces/Mixins), from the underlying representation of the artifact.
  2. Introduce the notion of 'cache' artifacts

Concrete Proposal

The core idea is to decouple the physical representation of an artifact from the ways in which user code interacts with it. This breaks down into four concepts:

Encoding

At present, Vizier's representation of artifacts consists of a small, opaque blob of text data (typically json). These are interpreted based on the specific type of artifact, but the interpretation is entirely unstructured and performed on read. There is no common structure to the artifacts. This, in particular, makes things like reachability checks hard, since inter-artifact dependencies (e.g., a SQL query over existing tables) always need to be implemented ad-hoc.

The first major goal is to define a schema definition language for Artifacts. The schema definition needs to capture:

Then, we define encodings for all of the existing artifact types, perhaps strengthening them somewhat (e.g., explicitly typed primitives, instead of generic parameters).

To emphasize the point, an encoding simply gives a name to the physical manifestation of the artifact, and dictates how it is stored in the database. This should be the minimum required to reproduce the artifact (see Artifact Caching below); and can should disregard any data that is only needed for efficiency (e.g., the URL of a file, but not the contents).

Some TODOs:

Interface

At present, Vizier uses ArtifactType and MIME types to differentiate different roles that an artifact can play. The Interface plays a similar role, by dictating a specific API to which an artifact can conform (i.e., governing how Vizier, its subsystems, and the user interacts with it). Some examples include:

Some TODOs:

Implementation

(An Encoding -> Interface, or Interface -> Interface edge)

In order to decouple Encoding and Interface, we need a binding between the two. Somewhere in the code, we need to be able to define code that implements a specific interface for a specific encoding. (e.g., how do I get the spark dataframe for a CSV file; How do I get the arrow dataframe, etc...).

Some TODOs:

Conversion

(An Encoding -> Encoding edge)

This is more/less the same as an implementation, save that it generates a new encoding (and consequent additional data)

Platform Interactions

Generic artifacts necessitate decoupling Vizier from its target platforms, including Spark (but also Scala and Python). This means that we need a code component to translate an Encoding of an artifact into the platform-native equivalent. The natural approach here is to define a set of tiered fallbacks:

  1. Platform-provided logic for directly translating an encoding into a platform-native representation (e.g., CSV File -> Spark Dataframe)
  2. Fall back to platform-provided logic for translating an encoding that implements a specific interface into a platform-native representation (e.g., Function)
  3. Fallback through conversions to an encoding that is supported by case 1 or 2. (e.g., convert dataframe to arrow -> spark)
  4. Fallback to just providing the encoding directly (e.g., as the JSON-serialized artifact)

Artifact Caching

[more to come]