Decouple data cleaning from code deployment

Context

The raw data for this project are collected via Google Forms and end up in a Google Sheet. There is personally identifiable information, specifically email addresses, the people opt-out of sharing publicly. However, those email addresses are used as primary keys to link people to related projects.

https://github.com/AvoinGLAM/h4og-dashboard/blob/de066efc1c1cb87cacd22fd7f4655d5724ce1b05/importer/src/import/import.js#L97-L100

This project uses a pre-processing script to prepare a data.json file to respect user privacy preferences.

https://github.com/AvoinGLAM/h4og-dashboard/blob/de066efc1c1cb87cacd22fd7f4655d5724ce1b05/importer/src/import/import.js#L110-L111

The pre-processing script is currently invoked as part of a build process that runs when there are changes to the codebase.

Since the data processing is coupled to the development process, the data may go out of date if new submissions are received without there being any changes to the code.

Proposal

Decouple the software development process from the data processing.

A developer can currently run the data processing step independently of software development. However, it would require a manual step and the deployment container would not contain the processed data.

Data storage/access

Ideally, the processed data would be made public without coupling it to the software container, such as by:

publishing it to public storage such as S3 or
committing the processed data to Git and publishing via GitHub pages

Upload cleaned data to public URL (e.g. S3)

Consider modifying the processing script to parse and upload the processed data to a public location (such as S3 or DigitalOcean Spaces) that the SPA can access directly. This would obsolete the backend server and could simplify the overall deployment process (such that static SPA code could automatically deploy via GitHub Pages).

Automate processing script (e.g., by serverless function)

Once the SPA can access the cleaned data via a public URL, the cleaning process might be automated by a serverless function that can run on a schedule.

Alternative: manually process data (periodically)

A developer can run the data processing script manually in a local development environment. This would open up some options, such as committing the processed data to the Git repository to be served from within the SPA (e.g., on GitHub Pages) as opposed to an external service (such as S3).

Consideration(s)

The volume of data and rate of changes is somewhat low for this project, so it is important to consider the manual work involved in refreshing the data in light of any considerations for automating the process.

AvoinGLAM / h4og-dashboard