MarquezProject / marquez

Collect, aggregate, and visualize a data ecosystem's metadata
https://marquezproject.ai
Apache License 2.0
1.78k stars 319 forks source link

Add the Datasource API #352

Closed ashulmanWeWork closed 5 years ago

ashulmanWeWork commented 5 years ago

Introduction In order to understand where a particular dataset lives, we need to create the concept of a Datasource. Example Datasources may include RedShift, S3, a MySQL instance, etc.

The Datasource object goes beyond just classifying the type of the data store, and also provides connection information about where the data lives. Properties should include a DataSource's name and a connection url.

Using DataSources with Datasets Every dataset has a datastore in which it lives, and this relationship is expressed on datasets.datasourceUUID.

Access patterns We have identified a few access patterns for a user to more information about a datastore:

API Endpoints GET /api/v1/datasources -- list all datasources GET /api/v1/datasources?urn= POST /api/v1/datasources -- create a datasource

URN format urn:<type>:<name> Ex: :redshift:staging-dw: where the name="staging-dw" and type="redshift"

Valid Types There will be a whitelist implemented at the application layer. Potential first set of supported types are: redshift, mysql, postgresql, snowflake

Examples [POST] Request Payload for Creation

{
  "name" : "building_team_mysql_staging",
  "connectionUrl" : "jdbc:mysql://mysql_1.staging.wework.com:3306/"
}

[POST, GET] Response Payload:

{
  "createdAt" : "2019-1-14T11:03:12.016Z",
  "name" : "building_team_mysql_staging",
  "connectionUrl" : "jdbc:mysql://mysql_1.staging.wework.com:3306/"
}

Field Details “Name”: string : This is a human-generated name for the datastore. It is required to be unique in the table. “connectionUrl”: string: The string should have the format “protocol://host:port/database”

Constraints

Case-Sensitivity Datasource URNs must be specified in lower-case.

wslulciuc commented 5 years ago

Fixed #384