airbnb / chronon

Chronon is a data platform for serving for AI/ML applications.
Apache License 2.0
673 stars 36 forks source link

Build metadata endpoint and directory walker #760

Closed yuli-han closed 1 month ago

yuli-han commented 2 months ago

Summary

We are supporting metadata upload to k-v store for key-value pair key->conf right now. We want to add a general class metadata endpoint to support more potential use cases.

This PR is to add two general class MetadataEndPoint and MetadataDirWalker

MetadataEndPoint:

case class MetadataEndPoint[Conf <: TBase[_, _]: Manifest: ClassTag](
    extractFn: (String, Conf) => (String, String),
    name: String
)

Defined with a extract function and an end point name. Extract function extracts the key-value pair from Conf(could be Join/GroupBy/StagingQuery) and file path(string). The name is the dataset name when we send the data to k-v store.

MetadataDirWalker:

class MetadataDirWalker(dirPath: String, metadataEndPointNames: List[String])

Go through the directory to iterate over all the config files and generate k-v pair metadata based on the metadata end points provided.

The PR adds two metadata endpoint ZIPLINE_METADATA and ZIPLINE_METADATA_BY_TEAM

CHRONON_METADATA: key -> conf json in string format e.g : joins/team/team.example_join.v1 -> {...}

CHRONON_METADATA_BY_TEAM: type/team -> list of key in string format e.g : joins/team -> a, b, c

Why / Goal

Test Plan

Testing by running the metadata-upload and fetch command for a join. https://docs.google.com/document/d/1X7n_jskS7JyiiqVB3pStgPilg6luy23ho58twCHeip8/edit?usp=sharing

Checklist

Reviewers

yuli-han commented 2 months ago

@better365 I update the dataset name from zipline to chronon now. For airbnb use case we also need to change the code in mussel otherwise the job will still fail. Will raise another PR in treehouse once this PR get stamped.