Propose strategy for developing with realistic-looking data pre-ATO

carjug commented 3 years ago

The general rule of thumb is: no production data until you have an ATO. So as we're building Unicorn, how do we do work and develop the solution pre-ATO, and have confidence in the solution?

We do have dummy data from the development server.

Given that the database migration is one of the key things we want to de-risk, how would we do this?

Do it on the development server. BUT we wouldn't be able to run the migration on staging because it has a copy of prod data. The concern is that the development server has a lot less data (only a few rows per table), and the data may look a little different, so we may not be able to iron out potential complications until we have to do the migration on prod.

How will we know if Unicorn is working properly if we can't compare the data retrieval outputs to MAPS?

We could ask to run scripts against prod to run checks but not copy data over.

Ultimately, what matters is what OCIO is ok with.

Options

Write a script to generate dummy data
Use development data and augment to make it more reflective of real data
Strip prod data of PII (e.g., copy of invoice from vendor, contract where they bid -- these are uploaded to the system but not retrievable)

Sheila prefers Option 3, because that would enable her to test the system better. Sheila can also easily prepare test data for us, inc a spreadsheet that we can bulk upload.

Tasks

[ ] Document what we'd like to do that would ideally involve production data/databases.
[ ] Document risks to project if we don't have access to prod data/dbs.
[ ] Prepare one or more potential strategies to reduce risks of using prod data.
[ ] Understand implications of recommended strategy on data migration testing path.
[ ] Discuss if we'd want to set up a dedicated testing environment with a stable test data set.
[ ] Reach out to OCIO to schedule a meeting to discuss, with intent of getting their sign-off (unless we haven't started intake process yet).

amymok commented 3 years ago

There are couple stages and work streams that would require a data strategy. It is best to talk about them separately because their needs of data are different.

New Application Development In this work stream, we would start minimally and continue to add more functionalities. We should only have the data that is necessary for the functionality being developed at any time in the cloud.gov database.

Database Migration In this work stream, we would want to migrate all data at once, and preserve the structures and data as-is first.

Merge Point When Database Migration is completed, we would want the data in New Application Development to comes from data in Database Migration through ETL. At this point, we will integrate the two work streams together.

After Merge Point This will mean that we have a clear working path from vendor's database to New Application Development's cloud.gov database. This will be done at the pace of the New Application Development: as data is needed to be ETL, it will then move from the completed Database Migration work stream to New Application Development work stream.

Strategy for data in New Application Development Use Prod data without PII, but only add data as needed

Milestone 1 - At this point, it doesn’t matter how this data looks like since the point is to make sure we bring up the application with some table/data. Pick out a table from current MAPS, figure out what columns we want, and create a few rows of dummy data manually.
In preparation for Milestone 2 and on: We pick sample table(s) from current prod MAPS (align with the user stories needs), and pick out a sample set of data and check with Sheila to make sure we strip out all the PII and replace with dummy data we agree on, and use these data. These data should be in the form of spreadsheets (a format that can let us transform and load into a database). Do we need to check with OCIO if we decide to use these prod data without PII? I would also recommend to write ETL scripts to load these data as if we are doing that from AWS PostgreSQL database to the cloud.gov database in New Application Development. This can mitigate some risks to integrate before the Merge Point (will need to flush out what this means)
We will continue to iterate on 2 until Merge Point.

Strategy for data in Database Migration Use data exist in current development MAPS database

Our plan is to first use sample database, so that is not relevant at this point.
The next step is to use MAPS development database Use whatever is in current database. The focus is to figure out if we can migrate the data correctly.

Strategy for Merge Point Use data used in New Application Development so far

From the table spreadsheets we created for the New Application Development:
- Wipe out the development MAPS database data for those tables (need to find out how we can do that, is that something the vendor needs to do)
- Load the table spreadsheets data into the development MAPS database (vendor's action?)
- Put the corresponding ETL scripts that we developed when doing New Application Development in place. This probably will need some fixing and changing for it to work.

Iterate until we have all we need.

After Merge Point is completed Use Prod data without PII, but only add data as needed

The process should still be:
- Check with Sheila to add new table spreadsheets
- Wipe out dev MAPS database with the new data
- Load table spreadsheets data into the dev MAPS database
- Develop ETL scripts to move data from dev MAPS db to New Application Development cloud.gov Database

amymok commented 3 years ago

Team, would like to see your thoughts on ☝️ . I think this is more of a high level view and align with what we are talking about, and @ninamak the budget lines discussion you are working on is more the details of which tables we want to start with.

@carjug @alexsoble love to hear from you on this, and feel free to suggest if anything doesn't sound right and if anything is missing.

alexsoble commented 3 years ago

[@amymok I gave this an initial read-through, there are tons of helpful ideas and strategies in what you wrote up! I want to spend more time with what you wrote // make sure I understand all the details // formulate any questions I might have.]

carjug commented 3 years ago

@amymok Thanks for all this information! It's super helpful to get it all out on "paper".

I may have more questions as we continue with this work, but my first question that popped up while I was reading through this was, under the "Strategy for Merge Point" heading there is a bullet that says

Wipe out the development MAPS database data for those tables (need to find out how we can do that, is that something the vendor needs to do)

And I wasn't quite sure what was meant by this. Can you expound on what "wipe out" means in this context?

amymok commented 3 years ago

@alexsoble yes, take your time. We can also talk thru it if that's helpful. It is just an initial thoughts, definitely nee to flush it out if this is somewhat the direction we want to go with.

@carjug, yes I have a lot of questions about how much we can do vs. we need from the vendor to do too. But let's say in the ideal world, I am envisioning as we are moving forward with getting more realistic data during our New Application Development work stream. As an example, we have created realistic data for the Budget, IAA, Direct Obligation tables, and that's all the data we need to use at that point. We can just work on it from the spreadsheet we created with Sheila and load that data into the cloud.gov database for development. And on the Database Migration work stream, we just keep on testing without worrying about the need to have realistic data. But once we are done with the Database Migration work stream, we want to complete the path from vendor's database to cloud.gov database. So in order to complete that, we will also have the ETL in place that will transform the data from vendor's development database and feed into cloud.gov database. So in order to have the same new realistic data we has been working with, we need to "wipe out" the tables in the vendor's development database, in this case, Budget, IAA, Direct Obligation tables, and replace the data in these tables to use the more realistic data we developed. So that path can be completed, and we can move forward with our development having the realistic data. So everything above is for development database, not staging, production database. Because we don't want to wipe those out.

The reason to have the complete path is that we want to test out that path before we test out on staging and production database. The path is needed because there may be new data coming in from the vendor's production application->vendor's database, so we want to make sure the data replication is in place for those environment until we are ready to cut the cord. And this complete path will allow the data replication to work with the ETL scripts to the cloud.gov database.

I am going to draw some diagrams to make is easier to understand than all these texts.

alexsoble commented 3 years ago

@amymok Ooooh, yes, diagrams I think will be very helpful! My thoughts/feedback/questions so far:

The concept of a "Merge Point" when the two workstreams meet is super helpful!
From your notes so far it's clear we'll need to use at least 2 if not all 3 of the options listed in the issue body
You mention both "sample database" and "development database", are these different things?
If we want to chop this discussion into smaller parts, the merge point could be a natural cutting point. For example, we could decide on our pre-merge-point approach and potentially defer some decisions about post-merge-point til later.

amymok commented 3 years ago

@alexsoble

From your notes so far it's clear we'll need to use at least 2 if not all 3 of the options listed in the issue body

Yes! I am trying to say that we probably need to use 2 to 3 of the options in the issue body, but detailing out when to use what, and which work streams to use which one.

You mention both "sample database" and "development database", are these different things?

For context, when I was working on #34, I introduced the concept on starting with "sample database" first before "development database". Here's the updated plan in Google Drive right now, it is still evolving as we find out here. The "sample database" is to use a database that AWS provided, so we can try things out on our own to identify the actual needs for doing the migration, so we can have a more targeted ask for OCIO and the vendor. And the "development database" is the one we actually have access right now from the vendor. In case there may be additional asks for the vendor, we want to be more informed about what that is we are asking for and don't want it to be a blocker to try things out because we are waiting for the vendor.

If we want to chop this discussion into smaller parts, the merge point could be a natural cutting point. For example, we could decide on our pre-merge-point approach and potentially defer some decisions about post-merge-point til later.

Yes, I think there is just so much to even test out with what we have right now up to that point, and plan can change because there are so much assumptions and unknowns leading there.

carjug commented 3 years ago

@amymok ok yes, I think a diagram would really help me! So far I'm having trouble understanding the utility in modifying the vendor's development database if we're moving away from their databases in general but I think I may be missing something.

ninamak commented 3 years ago

@amymok Thanks for laying this all out. I was a little confused about the "wipe out" part, but your above comment clarifies that for me a lot (I think). Instead of wiping out the vendor's developer db tables, could we instead wipe out the copy we'll create in ACF/HHS AWS? That way, we're not as dependent on the vendor? And instead of "wipe out," could we augment with the realistic-looking dummy data that Sheila helps create? That way, we still have data that is in both the new system and in MAPS, so we can ensure things we expect to work similarly in both systems are indeed doing that?

As for "Strategy for data in New Application Development," I agree that the Budget Lines work I'm doing will help inform the details of that strategy, so we should hold off on more specifics until we've had a chance to talk through that. The general principle seems to be that we'll generate our own data via spreadsheet, keeping things relatively simple yet still related to existing tables and data -- is that correct?

And I also agree that the Merge Point is a good place to cut. It's super helpful to get an early look into what you're thinking, but it seems like we'd get diminishing returns if we tried to flesh out many more post-Merge details at this time.

amymok commented 3 years ago

@ninamak I think the wipe out part we need to still flush out because we don't know if it is possible. The reason I am doing that is that is there is going to be a long time users are using both vendors MAPS1.0 app and our prototyping Unicorn app. And we need to sync up the data continuously, so we need to make sure that path to transform to our database is defined and working. If we use the ACF/HHS AWS, then we lost the data replication portion from vendor's to ACF/HHS. I think there may be other strategy to work around it. But yes, we should discuss more.

Yes, the reason to use spreadsheets that mimics the existing tables is because at one point we will need to connect from the actual MAPS1.0 PostgreSQL DB to our cloud.gov DB, and we want to develop the scripts to transform beforehand, so when that happens, it would be faster and easier to integrate.

@carjug @alexsoble @ninamak, I put my thoughts in some diagrams here. Probably need some walk thru, but feel free to take a look at it and comments if you have time.

ninamak commented 3 years ago

@amymok The diagrams make sense to me and I'm following what you propose! I can see how we'd be testing different aspects of the migration at various stages, working within the constraints we'll have. I left a couple comments -- one providing some thoughts on the likelihood of loading more realistic data into the dev DB, and one as a curiosity question.

A thing I'm noting for myself is that, when I work with Sheila to create the spreadsheets of realistic looking data, I think I'd like us to include some/all of the dev data, too, so that we can have more confidence that that data shows up in our new app similarly to how it's showing up in MAPS (assuming the dev data isn't total junk, which maybe it is).

This issue is currently assigned to me, but what is it you're expecting me to do with it next?

ninamak commented 3 years ago

Plan is to use eng co-work on 7/25 to walk through diagram

amymok commented 3 years ago

@carjug @alexsoble I added another version of a path we can take below the original one (inside dev environment). This one will load the spreadsheets to the ACF/HHS AWS Oracle DB. In this path, I think we need to turn off the data replication from vendor's Oracle to ACF/HHS AWS Oracle because I am unsure if we can pick and choose which table to do ongoing replication. It may only be at the schema level. And we actually do not need that anyways since we are in dev environment and we already tested out the path to do data replication from vendor's Oracle to ACF/HHS AWS Postgres.

FYI: @ninamak

ninamak commented 3 years ago

Thanks for the update! Makes sense to me. The only thing I was wondering about, though, is this part: "we already tested out the path to do data replication from vendor's Oracle to ACF/HHS AWS Postgres." This will be true, but we will have only tested with the data in the dev db, right? I recall the point being raised that that data may not be terribly representative of real world data, in terms of quantity, shape, and variability. So if we go with this alternate path, is there anything additional we'd want to do to more fully test out the data replication?

amymok commented 3 years ago

@ninamak Good question. We are still trying to mimic that from the ACF/HHS AWS Oracle database, so if there's any abnormalities or weirdness, it should still be caught there when we load the spreadsheets. It is just not at the vendor's database. The Vendor's database and ACF/HHS AWS Oracle database should be a direct copy.

amymok commented 3 years ago

@ninamak @carjug @alexsoble I pulled out the tasks list in the issue and answered them here. Anything I should add or change here before I ask Sheila to take a look?

Details are documented in the mural

Tasks

Document what we'd like to do that would ideally involve production data/databases.
- In summary, our strategy is broken down into two lanes:
- New RPM Development: Work with Sheila to create production-like data without PII as needed.
  - For each user story, we will pick out tables from MAPS1.0 to create these data in spreadsheets for us to use.
- Database migration from vendor's dev database to ACF/HHS AWS PostgreSQL database: Use what dummy data already in the vendor's dev database.
  - Since we are only looking at if the schema and data got moved and translated correctly, we do not need to use realistic data, and this also eliminate the need to load realistic data into vendor's dev database.
Document risks to project if we don't have access to prod data/dbs.
- We do not need access to the actual prod data/dbs, we will be creating production-like data without PII for new development.
Prepare one or more potential strategies to reduce risks of using prod data.
- We are deliberately creating the production-like data with Sheila, so these data will be audited and reviewed before we use it in dev environment.
Understand implications of recommended strategy on data migration testing path.
- We are separating the new RPM development path (production-like data without PII) and the data migration testing path (dummy data as-is from vendor's dev database). So we will not be loading any production data into vendor's dev database.
Discuss if we'd want to set up a dedicated testing environment with a stable test data set.
- We will develop and expand the data set as we need more production-like data. These production-like data will first be developed in spreadsheets, and will then be loaded into cloud.gov development environment.
Reach out to OCIO to schedule a meeting to discuss, with intent of getting their sign-off (unless we haven't started intake process yet)
- We have not started the intake process. But since we are only using production-like data without PII, we may not need to discuss prior to use.

ninamak commented 3 years ago

LGTM!

Anything I should add or change here before I ask Sheila to take a look?

How are you imaging Sheila will review all this? Are you thinking you'd send her over some links to review async, or schedule time to walk her through what's there? In the past, she's preferred something closer to you walking her through what she should pay attention to. Let me know if you want any help ID'ing the parts we'd want to call out for her attention.

amymok commented 3 years ago

@ninamak I am thinking about sending her this post, and just specifically the summary comment. I thought she wanted to know how she can help. I think there are too much details in the actual approach that is not necessary for her to know all of that, but more importantly about how we will involve her every step as we are doing new application development to create those production-like data without PII. I am happy to walk her thru at the high level what that means, unless she is interested in all the details.

carjug commented 3 years ago

LGTM, Amy!! 🚀

amymok commented 3 years ago

Closing this. Will update Sheila with our approach. In summary, we will ask Sheila to help us with creating production like data as we are implementing user stories. We will let her know what we need as we go, no action for her right now.

HHS / OPRE-OPS

Propose strategy for developing with realistic-looking data pre-ATO #43