code-controlled LA defined configuration for platform instances

dotloadmovie commented 1 year ago

The easiest way to expose configuration options to the platform is through the use of environment variables. Working backwards from that, how could a given configuration best be version controlled and pushed to the dagster pipelines?

cyramic commented 1 year ago

The following options come to mind:

Using AWS Parameter Store This would involve a script that would need to be run to link EC2 to the parameter store. Something like this:

#!/bin/bash
DB_PASSWORD=$(aws ssm get-parameter --name /myapp/DB_PASSWORD --with-decryption --query "Parameter.Value" --output text)
export DB_PASSWORD
# You can now use $DB_PASSWORD as an environment variable in your application

The advantage of this is that it uses something that could be replicated locally and in development environments pretty easily using .env files.

The disadvantage is that the above bash script would need to be run on the code server whenever it spins up so that it can link those values to the right environment variables though simpler methods may exist.

Using JSON in S3 This method would involve a separate step in dagster itself to load these values into the environment. It may be easier to implement, but we'd need to make sure that every pipeline loads these parameters in a universal way.

If we have A JSON or YAML or whatever structure for configurations in a secure S3 bucket or even as an AWS parameter, we can then instruct dagster to load this value first before any processing is done. It can then set these values in a way that the rest of the pipeline can use.

The advantage of this is that this puts the file into more easily version-controlled and audited territory. We can run checks on this more easily and there isn't much added complexity to implementation

The disadvantage of this is that this would likely mean that this would not be able to use the environment to define this and some changes may need to be made to make sure the values are passed throughout the pipeline once the configuration is loaded which may make development more complex.

MagicMiranda commented 9 months ago

London has in principle accepted the concept. Conceptually, people uploading and file and giving instructions could be seen to different things.. it is managed by the interface. As the user interface stands, the upload IS the instruction (TR) true for multiple pipelines? When it gets more complex, it will be separate but for right now it is one and the same. Not sure what that means for future interface dev. Could have a bunch of pipelines queued up and the only thing stopping the user from running pipeline on data that should not be run through the pipeline is the user interface.. and it feels like it shoudl be more than that Arguably we already have this... one set of data, 2 pipelines.... and user has to explicitly tell us if they want one or other or both run?

Environmental variable: could be an instruction from the LA/ London Councils listing X LA and Y data run under Z pipelines etc etc. We want to site security that the instructions sit with London Councils and it lines up with the DSA/ processing doc. Excel file if we have to (to get the list).

We have a pipeline that splits already; 903 - PanAgg to London Councils 903 - further descoped PanAgg to CA

Task: Spike on effort to assess what would need doing, does it work with the DSA and how and when we will do it. Schema for how they provide the instructions. MH has already seen his draft... and it just needs the finer details. And a specific change management process. there are grey areas re specificity of instructions... MH to RAG rate the bits we know and a concrete proposal to react to and MH to tag the folks to input So that we could consider it for next sprint (one after this one 21st Feb)

Matthew can deploy an EV in the client instance but that is not scalable.

MagicMiranda commented 9 months ago

Requirements need to be completed by MH and Co. then split to urgent and non urgent dev

tomrintoul-SF commented 9 months ago

I think that for any requirement we need someone who understands the users and the business need to take responsibility for requirement definition.Especially because this is often iterative.

I sugegst this to avoid situations where we cant move forward quickly because the requirement is unclear and it isnt clear wither who has the info to fix that, and the autority to decide that this version is the thing to develop a solution against.

tomrintoul-SF commented 9 months ago

And I think Michael is that right person to do that here, so I have taken myself off of the issue.

MichaelHanksSF commented 6 months ago

When pan-agg has been created; run pipeline processes based on processing instructions file, using @asset decorators to load the data from the csv and import into @op functions

SSDA903: -- filtered by LA and YEAR and made available to appropriate partner for 'Children's Services Insights' (see Workflow permissions row 2) -- filtered by LA and YEAR, processing with public data (Tambe's code), made available to appropriate partner for 'Sufficiency Analysis' (see Workflow permissions row 3)
CIN Census: -- filtered by LA and YEAR and made available to appropriate partner for 'Children's Services Insights' (see Workflow permissions row 2)
Annex A: -- filtered by LA and YEAR and made available to appropriate partner for 'Children's Services Insights' (see Workflow permissions row 2)

MichaelHanksSF commented 5 months ago

Priority tasks within this:

[ ] Don't write PlacedAdoption, AD1 and Social Worker cleanfiles files to hub
[ ] Don't write PlacedAdoption, AD1 and Social Worker la-agg files to hub
[ ] sufficiency-output retention period is 6 years, rather than 11 years

MichaelHanksSF commented 5 months ago

@patrick-troy As discussed, this is now being handled entirely through yaml files within the pipeline. There are two points in the flow where this intervention is made: cleanfile

[x] file type: config determines which file types should be processed based on whether any use cases use data from these file types e.g. for 903 we know that neither of our existing use cases use a couple of the files
[x] retention: config applies a retention period based on the maximum retention needed for any use case using data from the dataset e.g. for 903 the maximum is 11 out of 6 and 11

use case processing using la-agg files This is where the config specific to a use case is applied. This can therefore correspond directly with the instructions passed to us from the processor

[x] la: which LAs are currently signed up to this use case
[x] retention: what is the retention for this use case
[x] file type: which file types are required for this use case
[x] columns: any specific columns to be removed for this use case

We'll manage this through a documented change management process that will contain all of the information required here. So our job each time there is a change accepted from the processor will be to take that information and amend the config files based on that info.

MichaelHanksSF commented 5 months ago

@cyramic and @patrick-troy to deploy config files as json rather than yaml

SocialFinanceDigitalLabs / sf-fons-platform

code-controlled LA defined configuration for platform instances #37