datopian / metastore-lib

🗄️ Library for storing dataset metadata, with versioning support and pluggable backends including GitHub.
https://tech.datopian.com/versioning/
MIT License
10 stars 1 forks source link

[epic] Git-based MetaStore Service #13

Closed rufuspollock closed 4 years ago

rufuspollock commented 4 years ago

Epic: A stand-alone MetaStore microservice

STATUS: IMPLEMENTED 👍

MetaStore: storage for dataset metadata, not storing the data itself or the raw data blobs (files). Think CKAN Classic's DB tables, or the datapackage.json format.

The service should provide the following main capabilities:

Questions:

Acceptance

A library for git(hub) based metastore (no auth required):

Wrap library into a microservice

Tasks

8d + 2d for mocks

Analysis

Beginning of README-driven development

# creates project implicit
$ curl -X POST https://metastore/dataset/create { owner: xxx, name: yyy}
201 CREATED - { id: ... }

# check stuff
curl https://metastore/dataset/:id
curl https://metastore/project/:id

Design Public API

The following actions are exposed via Web API:

# TODO: what would be difference between this and the dataset ...
def project_read(project_id: str):
    return {
      'owner_org_or_user':
      'dataset': {
        data package object ...
      },,
      'issues': ... ,,,
      'flows': # future ...
    }

def dataset_read(dataset_id: str, revision: Optional[str] = None) -> Dataset:
    """Get dataset metadata given a dataset ID and optional revision
    reference; Would be nice if ``revision_ref`` can be a tag name,
    branch name, commit sha etc. like with Git.

    The return value is essentially the datapackage.json file from the
    right revision; It includes metadata for all resources. 

    dataset_id: tuple (xxx, yyy) or unique identifier
    """
    return {
      # datapackage.json ...
    }

def dataset_create(dataset):
    """
    dataset: is a valid data package object.

    {
      resources: [
        {
          'name': ...,
          'path': 'mydata.csv', # we assume this is in git lfs ...
          'sha256': '...',  # need ...
          'bytes': '...'
        }
      ]
    }
    """
    # Code here will extract ckanext-gitdatahub code

def dataset_update(dataset_id, dataset):
    """
    dataset: a full data package object
    """
    # Code here will extract ckanext-gitdatahub code

def dataset_delete():
    """
    TODO: semantics - at least for github. I think rather than archive we simply mark this in datapackage.json or do nothing at all - state is something managed at HubStore level (?)
    """
    # Code here will extract ckanext-gitdatahub code

def dataset_move():
    """Move a dataset between organizations (do we need this?)
    """

def dataset_purge(dataset_id: str):
    """Purge a deleted dataset

    This should delete the git repo
    """

def revision_list(dataset_id: str): -> List[Revision]
    """Get list of revisions for a dataset

    TODO: is all changes to the repo - or only to datapackage.json ... ANS: for now all the commits in the repo b/c e.g. a file might change but not datapackage.json
    """
    return [
      {
        "id": ...
        "timestamp": ..
      }
    ]

def tag_list(dataset_id: str): -> List[Tag]
    """Get list of tags for a dataset
    """

def tag_create(dataset_id: str, tag_name: str, **kwargs) -> Tag:
    """Create a tag (named revision, or "version" in the old 
    ckanext-versions terminology)
    """

def tag_update(dataset_id: str, tag: str, **kwargs) -> Tag:
    """ Allows actions like change the name, the description, etc. (tag    
    metadata)
    """

def tag_read(dataset_id: str, tag: str) -> Tag:
    """Get tag metadata
    """

def tag_delete(datasett_id: str, tag: str) -> None:
    """Delete a tag
    """

Porcelain API:

def dataset_revert(dataset_id, to_revision_ref: str) -> Dataset:
    """Revert a dataset to an older revision / tag

    Under the hood this is a `git revert` like operation, 
    and is somewhat equivalent to ckanext-versions' 
    `dataset_version_promote` action.
    """

def revision_diff(dataset_id, revision_ref_a: str, revision_ref_b: str) -> DatasetDiff:
    """Compare two revisions of a dataset and return a 'diff' object.

    Maybe this is best handled as a client-side operation and doesn't
    need an API
    """

For gates is a requirement:

Stuff we only need if we're doing CKAN actions (vs. an independent microservice):

def get_resource(dataset_id: str, resource_id: str, revision_ref: Optional[str] = None) -> Resource:
    """Get resource metadata in revision, similar to ``get_dataset``
    """
    return filter(..., get_dataset(dataset_id, revision_ref))

Internal API

API for extensions to hook into

Github

Repo

https://developer.github.com/v3/repos/#get-a-repository

GET /repos/:owner/:repo

DELETE /repos/:owner/:repo

https://developer.github.com/v3/repos/#delete-a-repository

Contents

Parameters:

https://developer.github.com/v3/repos/contents/

GET /repos/:owner/:repo/readme
GET /repos/:owner/:repo/contents/:path

Tags

https://developer.github.com/v3/git/tags/

GET /repos/:owner/:repo/git/tags/:tag_sha

Commits

https://developer.github.com/v3/repos/commits/

GET /repos/:owner/:repo/commits

Gitlab

:::info Actually think gitlab may have the cleaner API. E.g. having projects as first class and repos as distinct. :::

https://docs.gitlab.com/ee/api/README.html

https://docs.gitlab.com/ee/api/projects.html

shevron commented 4 years ago

This is all done, excluding wrapping with a service which will be done separately as needed.