ONSdigital / dp-data-pipelines

Pipeline specific python scripts and tooling for automated website data ingress.
MIT License
1 stars 0 forks source link

Configuration dictionary added #99

Closed SarahJohnsonONS closed 4 months ago

SarahJohnsonONS commented 5 months ago

What

The pipeline-config.json is being replaced with a dictionary (CONFIGURATION in dpypelines/pipeline/configuration.py) containing the details required to run a transform on a specific dataset. These details will be accessed via a dataset_id, which is used as the CONFIGURATION dictionary key (actually a regex pattern that matches the dataset_id).

The dataset_id should be specified as part of the input metadata. This value can then be passed to the relevant dataset_ingress function (CONFIGURATION[secondary_function]), with the correct configuration details.

from dpypelines.pipeline.dataset_ingress_v1 import dataset_ingress_v1
from dpypelines.pipeline.shared.transforms.sdmx.v1 import (
    sdmx_sanity_check_v1,
    sdmx_compact_2_0_prototype_1,
)

# Regex pattern matching `dataset_id` as dictionary key
CONFIGURATION = {
    # Default configuration (regex pattern matches any string of characters of length >= 0)
    "^.*$": {
        "config_version": 1,
        "transform": sdmx_compact_2_0_prototype_1,
        "transform_inputs": {"^data.xml$": sdmx_sanity_check_v1},
        "transform_kwargs": {},
        "required_files": ["^data.xml$"],
        "supplementary_distributions": ["^data.xml$"],
        "secondary_function": dataset_ingress_v1,
    },
    # `cpih` config details
    "^cpih$": {
        "config_version": 1,
        "transform": sdmx_compact_2_0_prototype_1,
        "transform_inputs": {"^data.xml$": sdmx_sanity_check_v1},
        "transform_kwargs": {},
        "required_files": ["^data.xml$"],
        "supplementary_distributions": ["^data.xml$"],
        "secondary_function": dataset_ingress_v1,
    },
}

How to review

Make sure you understand how the values in CONFIGURATION will be used to configure a pipeline, and that all fields required to run dataset_ingress_v1() are present.

Who can review

Anyone.