YACS-RCOS / hamilton

Streaming Data Pipeline for YACS, and hopefully other things too!
1 stars 0 forks source link

Create Specification for Hamilton Source Format Version 1.0 #3

Open Bad-Science opened 6 years ago

Bad-Science commented 6 years ago

The specification for how sources should expose data needs to be formally defined. This purpose of this issue is to document the source format in an unambiguous way, with some examples included.

This specification will be used by every source for the foreseeable future, so it must be well polished. We will aim to support this Version 1.0 indefinitely, although that is not guaranteed at this time.

Some R&D has been done in creating the prototype, codenamed malg, but further refinement is needed. We have decided to go with a normalized form, instead of the nested form we used in malg. We may decide to add support for nesting or JSON pointers in a future minor version 1.X, however.

Bad-Science commented 6 years ago

Working on this in #5

This is what the draft looks like as of now:

Sources

A Hamilton instance can consume any number of sources. A source is meant to be an adapter between a stateful data source and the Hamilton pipeline. These sources may contain entirely disjoint data sets, or may contain overlapping or incomplete collections and records. The Hamilton pipeline will automatically resolve and amalgamate all of the collections and records from each of the sources.

To allow for this, and to simplify development, all sources must comply with the given specification (below). As long as the source complies with the specification, it can be written in any language or framework, and may interface with any external services or dependencies.

Specification

The source format borrows most of its structure from json:api, with a few modifications. The most significant difference is in the "relationship object". The Hamilton "relationship object" allows for relationships defined by arbitrary fields.

A source SHOULD represent a logical or physical source of data. A source MUST contain one or more HTTP endpoints. Each endpoint MUST respond with JSON. The JSON response MUST have the top-level objects meta and data.

The meta field MUST contain the field source. The source field containts meta-information on the source itself. It MUST be an object and MUST have the following fields:

The data field contains the records themselves. It MUST be an array containing zero or more objects. We will call these objects Records.

Each record in MUST have the following fields:

Examples

{
    "meta": {
        "source": {
            "type": "courses",
            "name": "catalog"
        }
    },
    "data": [
        {
            "type": "courses",
            "id": "courses/CSCI/1200",
            "attributes": {
                "shortname": "CSCI 1200",
                "number": "1200",
                "longname": "Data Structures",
                ...
            },
            "relationships": {
                "subject": {
                    "data": {
                        type: "subjects",
                        "attributes": { "shortname": "CSCI" }
                    }
                }
            }
        },
        {
            "type": "courses",
            "id": "courses/MATH/2400",
            "attributes": {
                "shortname": "MATH 1200",
                "number": "2400",
                "longname": "Differential Equations",
                ...
            },
            "relationships": {
                "subject": {
                    "data": {
                        type: "subjects",
                        "attributes": { "shortname": "MATH" }
                    }
                }
            }
        }
    ]
}
{
    "meta": {
        "source": {
            "type": "sessions",
            "name": "sis"
        }
    },
    "data": [
        {
            "type": "sessions",
            "id": "sessions/201809",
            "attributes": {
                "shortname": "201809",
                "longname": "Fall 2018"
            }
        }
    ]
}
{
    "meta": {
        "source": {
            "type": "sections",
            "name": "sis"
        }
    },
    "data": [
        {
            "type": "sections",
            "id": "sections/CSCI/1200/201809/01",
            "attributes": {
                "shortname": "01",
                "seats": 10,
                "seats_taken": 8,
                "status": "open"
            },
            "relationships": {
                "session": {
                    "data": {
                        "type": "sessions",
                        "attributes": { "shortname": "201809" }
                    }
                },
                "course": {
                    "data": {
                        "type": "courses",
                        "attributes": { "shortname": "CSCI 1200" }
                    }
                }
            }
        }
    ]
}