ONSdigital / dp-data-pipelines

Pipeline specific python scripts and tooling for automated website data ingress.
MIT License
1 stars 0 forks source link

change approach to configuration #88

Closed mikeAdamss closed 4 months ago

mikeAdamss commented 5 months ago

What is this

Currently we have two point of configuration:

The pipeline config provided externally: https://github.com/ONSdigital/dp-data-pipelines/blob/sandbox/tests/fixtures/test-cases/test_pipeline_config_valid_id.json

The (internal to the repo) transform details that we get from here: https://github.com/ONSdigital/dp-data-pipelines/blob/sandbox/dpypelines/pipeline/shared/details.py (that uses the "transform_identifier")

We need to combine both sources of information so that they can be explicitly specified.

For our purposes here assume you are given an explicit identifier for the sourse, i.e cpih.

This makes the process a whole lot simpler and removes the burden of config from providing systems (which given it looks like thats someone trying to write a spreadsheet..... probably a good idea).

cpih example (I might be wrong, please investigate) follows:

dpypelines/pipeline/configuration.py


from dpypelines.pipeline.shared.transforms.sdmx.v1 import (
    sdmx_sanity_check_v1,
    sdmx_compact_2_0_prototype_1
)
from dpypelines.pipeline.dataset_ingress_v1 import dataset_ingress_v1

configuration = {
    "cpih": {
        "config_version": 1,
        "transform": sdmx_compact_2_0_prototype_1,
        "transform_inputs": {
            "^data.xml$": sdmx_sanity_check_v1
            },
        "transform_kwargs": {},
        "supplementary_distributions": [
            "^data.xml$"
        ],
         "secondary_function":  dataset_ingress_v1 
    }
}

What to do

In a nutshell, this is "how/can we replace pipeline-config.json and the transform_details dict with a single dict of configuration.

Acceptance Criteria

Note - do not remove the old system just yet.