YACS-RCOS / hamilton

Streaming Data Pipeline for YACS, and hopefully other things too!
1 stars 0 forks source link

Create Source Connector for University Sources #2

Open Bad-Science opened 6 years ago

Bad-Science commented 6 years ago

In order to ingest data from university sources, we should create a custom Kafka Source Connector to poll the university sources and read the data into one or more topics. For more info on developing Connectors, see https://docs.confluent.io/current/connect/devguide.html.

Each Connector should handle one endpoint from one source and should have exactly one task to do so.

A Connector Task should make an HTTP request to its source endpoint at a regular interval. The university name, endpoint, source name, and polling interval will be provided as properties to each Connector instance.

The HTTP request should include an updated_since parameter, which should be the time at which the last successful poll was performed. The Connector Task should use Kafka's built-in offset storage to store this timestamp whenever a poll is completed, and resume from the stored timestamp if it exists when the Connector Task is initialized.

The Connector Task should expect from its endpoint a JSON response that is compliant with the Hamilton Ingestion Format Version 1 specification, defined in #3. Whether validation of this data should occur in the Connector Task is yet to be decided.

Each parsed record should be added as a message to the appropriate topic. Topic names should be of the form: uni:<university>.src:<sourcename>.type:<recordtype>. The key for each message should be the unique identifier of the record, and the value should be the contents of the record (its attributes and relationships).

We are aiming for exceptional test coverage in this repository, so all code must be well tested. No untested code will be merged.