Load publicly available ONS/Ofsted files onto data platform to be accessible by pipeline code

dotloadmovie commented 6 months ago

Business Case:

For the Dynamic Sufficiency tool produced for London, Commissioning Alliance hosted the Power BI tool and built the data model for it as an Azure data pipeline. This data model takes data from three sources:

the data platform (sufficiency-output from existing SSDA903 pipeline)
publicly available ONS data (imported directly to Azure)
restricted monthly Ofsted files (shared with them directly by Ofsted and imported directly to Azure)

For the East of England version of Dynamic Sufficiency, Hertfordshire will host the Power BI tool. For this, they will need the data modelling to have been performed on the data platform, which requires:

a) the publicly available ONS data to be imported to the data platform b) a publicly available version of the Ofsted data to be imported to the data platform c) the data modelling itself (that results in fact and dim tables) to be written at the end of the sufficiency-output pipeline code

Additional benefits to achieving all of this on the data platform are:

the data model code will be in a public repo, adding transparency to this part of the process
new regions can immediately take advantage of this

This ticket relates to a) and b) the addition of publicly available ONS data and Ofsted data to the data platform

Problem Statement:

In order to develop a full data model for Dynamic Sufficiency, we need the datasets that we are currently missing. Two of these are publicly available ONS tables and the other is the annual Ofsted file for providers. These need to be imported to the platform and made available to the SSDA903 pipeline code. Some data transformation steps should be conducted on the postcode directory file before they are incorporated into the 903 pipeline code, as they are cumbersome and should be run only when the input file is updated, rather than every time the 903 pipeline runs.

Data Sets In Scope:

Within the Ofsted file, the tab that needs to be saved is "Provider_level_at_31_Aug_2023"

Use Case(s):

1 Social Finance

As a developer of the SSDA903 pipeline code, I need access to publicly available ONS datasets so that they can be incorporated into the Dynamic Sufficiency data model
The data needs to be held such that it can be accessed by any pipeline code, but cannot be edited i.e. I can make a copy of the data in this bucket for any purpose but cannot save new information into the bucket
I need this resolution to be transferable: clear objective is that SF is not left in a position where they cannot hand this over to an External Provider to maintain ongoing
I need this to be as cheap and as fast to deliver as possible
I need to know if this is funded/will ever be funded/ how it can be funded so it can be addresssed. UPdate 20 Feb 24: £70k from EoE should cover it.. but might get more.

1. East of England

As the Power BI Analyst responsible for creating the Dynamic Sufficiency dashboard, I need to have a data model that has every data set I require readily available to be loaded into Power BI.

I need this so that I may;

create all the necessary views for the user that I can currently create, but using the pipeline on the platform to do this as one source for my views.
be able to build the Dynamic Sufficiency dashboard using a data model that is built on the data platform
make the data model for Dynamic Sufficiency open source

5. IG Considerations:

Does the current IG cover this? - Yes

Other IG notes and/or actions:

6. Technical Proposal:

Steps:

create process to access public files and deposit in storage on data platform
for postcode directory, implement additional processing steps that reduce size of dataset to postcode sector level (Michael has non-production ready code to execute these steps)
make files read-only available to 903 pipeline code

Estimated cost to deliver:

Development time: X Sprints (Y weeks), Z Developers (agile deployment of different Developers as skills are needed)

END

dotloadmovie commented 6 months ago

I believe this is a relatively simple task - we need to run as a separate pipeline with Michael's additional processing code at the core, wrapped in something to cURL the dataset from the static URL provided by the public providers.

dotloadmovie commented 6 months ago

Branch created - secondary cURL functionality will be scripted here

MagicMiranda commented 6 months ago

Will run into next sprint but simple task. Dave comfortable with task. Connect to external data set and making it available. Will talk to MH very soon. All happening at FE of solution, no security issues.

Is this step 1 of many or 1 and done? Good for any external data set in the future. Will be part of the EV effort. Instance determined etc... for now DT is responsible for timings and frequency for pulling new files in.

MagicMiranda commented 6 months ago

All 3 files publicly available and tickets have been merged. File refreshes will be logged in the log file. Will look to see if expected file is there. These files are updated very infrequently. To be discussed MH and DT. DR effort required when time and resource allow.

SocialFinanceDigitalLabs / sf-fons-platform