Urban-Analytics-Technology-Platform / popgetter

https://popgetter.readthedocs.io/en/latest/
Apache License 2.0
7 stars 1 forks source link

Refactor Belgium DAG for template to generalise to other countries #95

Open sgreenbury opened 6 months ago

sgreenbury commented 6 months ago

From discussion with @yongrenjie as part of #92

Currently the individual census tables are filtered through the used of needed datasets and a corresponding partition.

As begun in #92 (see this section) the config for derived columns can be expanded to include:

To enable the above, the type for derivation config (currently: dict[str, tuple[str, list[DerivedColumn]]]) can be updated to include the extra required items.

This could be something like:

# One per derived table
class DerivedColumn:
    hxltag: str
    aggregation_func: Callable[[pd.DataFrame], pd.DataFrame]
    output_column_name: str
    human_readable_name: str

# One per source table
class MetricDerivationInstructions:
   geography_level: str
   geo_id_col_name: str
   derived_columns: list[DerivedColumn]

Also see if needed_datasets + source_metrics assets can be skipped entirely.

Following any refactoring this pattern should be readily applicable to other countries to be updated in the pipeline (e.g. Scotland, NI, England/Wales, USA) new countries being added that conform to this DAG pattern for how the data is provided.

sgreenbury commented 5 months ago

The original aim of issue is superseded in porting Northern Ireland #98. Consider whether to keep open for incorporating all other census tables as metrics (@andrewphilipsmith for reference)