GSA-TTS / FAC

GSA's Federal Audit Clearinghouse
Other
18 stars 5 forks source link

Develop strategies for data loading #3447

Open jadudm opened 4 months ago

jadudm commented 4 months ago

Story

We need to be able to repeatably load data into all of our environments. For example:

  1. We need to be able to load historic (public) data into our dev and lower environments for performance testing. This is hundreds of megabytes of SQL.
  2. Our API is specified as SQL. It wants to be torn down before migrations and stood back up after migrations. We currently do this as a Django command that executes the SQL. It could also just be psql calls.
  3. We have static data that we need for computation or meeting other regulatory needs.

Unfortunately, it is unclear how we should do this:

  1. In a documented, controlled manner
  2. Reliably
  3. Repeatably
  4. With a way to observe the process/results
  5. With a way to test the process/results

Background

https://github.com/GSA-TTS/FAC/issues/3262

is an exploration of some of this.

Should this be...

  1. A GH repo we pull and execute code from?
  2. Multiple repos?
  3. Something else?

We need a solution that works well in all of our environments---whether this is a container, repo, or something else, it needs to work up and down the stack, and needs to be easily extended to future data loading needs.

gsa-suk commented 3 months ago

Dissemination public prod data with 100 fake non-public data is now available in https://drive.google.com/drive/folders/1gUsqD31Pkd17CruE4PWwwPKJVUssYNnI?usp=drive_link.

gsa-suk commented 3 months ago

A potential procedure for automatic data loading - https://docs.google.com/document/d/1KlLpbVZr4JY3MdnxqNMuKjet4oTzCvzzMpY51n6RIBY/edit.