Datahub service for running dataflows
The service is responsible for running the flows for datasets that are frequently updated and maintained by Datahub. Service is using Datapackage Pipelines is a framework for declarative stream-processing of tabular data, and DataFlows to run the flows through pipelines to process the datasets.
You will need python 3.x
pip install -r requirements.txt
Each "folder" in datasets
directory is named after publisher's username and each dataset in it is a standalone repository on GitHub and should be submoduled. To add a new datasets you will need to submodule your dataset repo into related directory (or create if not exists).
mkdir datasts/example
cd datasets/example
git submodule add https://github.com/example/my-awesome-dataset
Each dataset should have it's flows written as python script and pipeline-spec.yaml
pointing to flow to run:
annual-prices.py
- script responsible for getting the data, tidy and normalisation
from dataflows import Flow, dump_to_path, load, add_metadata
def flow(parameters, datapackage, resources, stats):
return Flow(load(load_source='http://www.exampel.com/my-data.csv'))
pipeline-spec.yaml
- metadata about pipelines. Here you should define which flows exactly to run and where the config file is savedexample-flow:
pipeline:
- flow: annual-prices
- run: datahub.dump.to_datahub
parameters:
config: ~/.config/datahub/config.json.example
Factory server will read pipeline-spec.yaml
for each dataset and run the flows and processors stated there. In the example above
annual-prices.py
) and load the data from http://www.exampel.com/my-data.csv
datahub.dump.to_datahub
processors and push files to datahub.ioTo publish dataset on Datahub, each user has it's own config file. We need this config file for each user who is subscribed to factory in order to push datasets under appropriate username.
Config files for Datahub are usually saved in ~/.config/datahub/config.json
. You will probably need to login with your datahub account if you can't find one. Login in and copy your config file in secrets directory.
data login
cp ~/.config/datahub/config.json secrets/config.json.example
In order to add new config file to the list, you will have to add cinfig.json.example
to the secrets/secrets.tar
Which is encrypted. Please contact if you are not the member of datahub developers team, else:
secrets.tar.enc
from private GitLab repositorysecrets.tar
cinfig.json.example
to the directorysecrets.tar
travis enctypr-file
and push back to github
# Extract
tar xvf secrets.tar
# Add new Config
cp ~/.config/datahub/config.json secrets/config.json.example
# Archive again
tar cvf secrets.tar secrets/
# Encrypt
travis encrypt-file secrets.tar
# Commit and push
git add secrets.tar.enc
git commit -m"example user's config"
git push
When working locally you will need to update all submodules
git submodule init && git submodule update