YACS-RCOS / hamilton

Streaming Data Pipeline for YACS, and hopefully other things too!
1 stars 0 forks source link
Hamilton

Table Of Contents


Introduction

Hamilton is, put mostly simply, a batteries-included streaming data pipeline. It consumes data from a number of sources, and produces a combined result.

What makes Hamilton special is its ability to identify and combine, and update records across different sources in an automatic and configurable way. Hamilton also supports and handles relationships between records, and can automatically resolve relationships based on arbitrary attributes. Hamilton is strictly schema-less, and makes no assumptions about the contents of the records or the structure of the relationships.

Hamilton is build on Kafka, and makes heavy use of the Kafka Streams and Kafka Connect APIs. Kafka affords Hamilton the ability to consume both streaming and purely-stateful data sources, and provides an API specification for stateful data sources.

Any arbitrary data source can therefore be used with Hamilton by creating an adapter microservice in the programmer's language of choice


Pipeline


Sources

A Hamilton instance can consume any number of sources. A source is meant to be an adapter between a stateful data source and the Hamilton pipeline. These sources may contain entirely disjoint data sets, or may contain overlapping or incomplete collections and records. The Hamilton pipeline will automatically resolve and amalgamate all of the collections and records from each of the sources.

To allow for this, and to simplify development, all sources must comply with the given specification (below). As long as the source complies with the specification, it can be written in any language or framework, and may interface with any external services or dependencies.

Specification

The source format borrows most of its structure from json:api, with a few modifications. The most significant difference is in the "relationship object". The Hamilton "relationship object" allows for relationships defined by arbitrary fields.

A source SHOULD represent a logical or physical source of data. A source MUST contain one or more HTTP endpoints. Each endpoint MUST respond with JSON. The JSON response MUST have the top-level objects meta and data.

The meta field MUST contain the field source. The source field containts meta-information on the source itself. It MUST be an object and MUST have the following fields:

The data field contains the records themselves. It MUST be an array containing zero or more objects. We will call these objects Records.

Each record in MUST have the following fields:

Examples

{
    "meta": {
        "source": {
            "type": "courses",
            "name": "catalog"
        }
    },
    "data": [
        {
            "type": "courses",
            "id": "courses/CSCI/1200",
            "attributes": {
                "shortname": "CSCI 1200",
                "number": "1200",
                "longname": "Data Structures",
                ...
            },
            "relationships": {
                "subject": {
                    "data": {
                        type: "subjects",
                        "attributes": { "shortname": "CSCI" }
                    }
                }
            }
        },
        {
            "type": "courses",
            "id": "courses/MATH/2400",
            "attributes": {
                "shortname": "MATH 1200",
                "number": "2400",
                "longname": "Differential Equations",
                ...
            },
            "relationships": {
                "subject": {
                    "data": {
                        type: "subjects",
                        "attributes": { "shortname": "MATH" }
                    }
                }
            }
        }
    ]
}
{
    "meta": {
        "source": {
            "type": "sessions",
            "name": "sis"
        }
    },
    "data": [
        {
            "type": "sessions",
            "id": "sessions/201809",
            "attributes": {
                "shortname": "201809",
                "longname": "Fall 2018"
            }
        }
    ]
}
{
    "meta": {
        "source": {
            "type": "sections",
            "name": "sis"
        }
    },
    "data": [
        {
            "type": "sections",
            "id": "sections/CSCI/1200/201809/01",
            "attributes": {
                "shortname": "01",
                "seats": 10,
                "seats_taken": 8,
                "status": "open"
            },
            "relationships": {
                "session": {
                    "data": {
                        "type": "sessions",
                        "attributes": { "shortname": "201809" }
                    }
                },
                "course": {
                    "data": {
                        "type": "courses",
                        "attributes": { "shortname": "CSCI 1200" }
                    }
                }
            }
        }
    ]
}

Contributing

We encourage you to create issues and contribute to Hamilton! To contribute fork the repo, comment on an issue, and submit a pull request to the master branch. Build checks and code reviews are required before merging. Once checks have passed, a project owner will merge the changes.


Code of Conduct

We strive to create and maintain an inclusive, welcoming and safe community for all. Please review our Code of Conduct in CODE_OF_CONDUCT.md before participating in the project.


License

Hamilton is currently released without a license! Oh no! The project will soon be released with an OSI approved license, once it has reached a higher level of maturity (very soon!).