dwyl / postgres-schema-migration-checker

[WiP] So you have a production PostgreSQL Database that you want to update and be sure you haven't lost any data...?
5 stars 0 forks source link

Why? What? How? #1

Open nelsonic opened 7 years ago

nelsonic commented 7 years ago

@iteles Please add the photos you took of the sketches from today into this description of the issue (or send them to me!) thanks! ⭐️

img_0695 img_0698 img_0696 img_0697

Note: @alexeyklyukin has good experience with PostgreSQL and gave a Presentation on the problem at PGConf 2014: https://docs.google.com/presentation/d/1TV0bExFwVy-_d6C7A8Z2JL9Z9tvtkuZv3D58fkC3GWQ it might be worth reaching out to him to understand what the "state of the tools" is in 2017...?

nelsonic commented 7 years ago

@des-des could this be useful:

des-des commented 7 years ago

Ok this exists for sql server. It finds the difference between two dbs, and builds a migration script. This is the kind of thing we are interested in but for postgresql. https://opendbdiff.codeplex.com/

https://www.npmjs.com/package/dbdiff

des-des commented 7 years ago

Ok, This is some js diffing two dbs: https://github.com/gimenete/dbdiff/blob/master/dbdiff.js.

Works for postgresql!

des-des commented 7 years ago

What are we proposing to build?

The title here is postgres-schema-migration-checker. To me this suggests that we are building a ci tool to check if a schema migration will work.

Ie given a migration script, can we apply it to the database without breaking anything. This seems different to my impression of our discussion but makes more sense.

Given this outlook, I possible steps as.

  1. Create a dummy database with existing schema. Fill it with either anonymised or generated dummy data.

Attempt to apply the migration, if it works we are okay to apply the migration to production.

To outline what I think @nelsonic was suggesting (the whole migration process).

  1. Push up feature branch with schema change.
  2. build new db from schema.
  3. try to transfer anomymised / mock data into the new database.
  4. run tests on the new and populated db.
  5. success
  6. merge
  7. build a new db from the schema
  8. transfer data from old db to new db.
  9. Switch production databases.

In this process, I am not sure if we can depend on reliability of transactions happening during the copy in step 8.

What problem are we trying to solve?

Oxford abstracts have many migration scripts. Each one is run on server startup. (I assume this will happen when new code is deployed) These migration scripts need to work even if they have been run before, as they get applied many times to a production db.

As the number of migration scripts grow, this process is becoming hard to manage. The db schema become more confused, as to understand the current schema the effect of many migration scripts must be taken into account.

I think another problem here is testing time. Buillding a test db is taking a long time, as all the migration scripts need to be applied.

@Conorc1000 @roryc89 Do you feel my description of your problem is accurate? Is there anything you can add.

Conorc1000 commented 7 years ago

@des-des I think this is a good description of the problem. (Yes we currently run our migrations each time we deploy to heroku).

roryc89 commented 7 years ago

@des-des I agree that this an accurate description of our most pressing schema migration issues. Thanks for helping out! :)

des-des commented 7 years ago

@nelsonic It seems to me that normally you would apply a migration schema to a db. Rather than build a script that moved data between two dbs with different schema, I think this might be confusing me.

This is also related to my previous question: During the actual deployment step, how can we safely keep the client / server live while the copy is in progress? eg If we send a POST to both live db and db being created can we be sure that the data will not get added twice.

@iteles Do you have the photos?

iteles commented 7 years ago

@des-des So sorry, I remember uploading them and checking the preview to make sure they were there but must have forgotten to hit 'Update comment' after that 😭

I've updated the top comment with them now. Apologies again.

nelsonic commented 7 years ago

Hi @des-des yes, "normally" a migration would be applied to the DB. What we need to know is if the migration schema will "break" any existing data in the Database.

More "advanced" or "mature" schema migrators like Active Record will attempt to do this for you. We need a similar process for our hand-written SQL table definitions. It could be that there is already a tool out there that looks at a two versions a database with slightly different schemas e.g. changing a date field's type from varchar to date which then Tests if the existing data will be converted without corruption/loss.

Our idea was to investigate the <option> of using an existing Schema Migration script/tool or if we don't find one to write one that does exactly what we need.

If we need to clarify the requirements further, I'm happy to do so. I agree that knowing exactly what we are trying to achieve before we set out on the journey will save huge amounts of time and frustration, so thanks for asking questions and please keep them coming! 👍

Also, to be clear if we can make this project/work reasonably generic, i.e. it can be used to test if data in any existing Postgres database will be affected by a schema migration (change in the structure of tables). we will be able to re-use this for all/any project that has a PoSQL DB. that will include many of our existing projects and several future ones. and crucially there are tens of thousands of projects where this would be useful. Because Postgres is not "going away" any time soon... https://github.com/dwyl/how-to-choose-a-database/issues/4

des-des commented 7 years ago

@nelsonic awesome. That makes sense! Will try to put something together soon!

des-des commented 7 years ago

Ok, so I see there being three parts here

1: Managing the application of migration scripts in production.

This is the focus of most of the tools @nelsonic has linked to. The main idea here is to ensure migration scripts are only applied once. The target db will have a table of previously applied schema. @Conorc1000 @roryc89 This provides a partial solution to your problem. Deploying a tool similar to these would help you, since your migration scripts would not need to guard against being doubly applied. You could have confidence, that for any db the migrations would be applied only once and in order.

2: Automatic creation of migration schema

Given a pull request with a change of schema, can we automate the creation of a migration script? Yes, we can use a tool like dbdiff. Q: What is the ux when this fails? Q: This tool does not provide a rollback, is this a problem? Q: how can we insure that

GetSchemaFromLiveDb(ApplyMigration(dbN-1, createMigrationFromDiff(schemaN-1, schemaN))

and

GetSchemaFromLiveDb(createDbFromSchema(schemaN))

are the same Where are the migration schema kept? Does the ci tool commit them to the repo? I think we just run the creation tool locally

3: Testing if a schema does not lose data.

Too me, although it may be tangential to the problem at hand, this project needs a way of being confident that migrations are not losing data before it leaves beta.

This is the problem Nelson's drawings are describing. We can break this problem into two parts

  1. Create a test db as a playground to test the migration. We can do this in three ways: 1 (May not keep client data secure) copy the db; 2, create mock data by inspecting the schema; 3, anonymise data from prod db to use in mock. Q: Our tool does not want to access the data for security reasons, that means we cannot access the data to anonimyse it. We would need to build two separate tools? One has access to the data and can spit out what we want, the other (the ci tool we are discussing) does not have access. I am not 100% sure of the benefit here.
  2. Run tests on the mock db:
    1. Build a tool to automate the creation of these tests. Q. is this the way we want to go, this seems like a hard problem, I can focus my reseach here?
    2. Run tests build by the user. Ie we have a staging db with lots of mock data in it. We have integration tests pulling data from this db. We can check if the migration makes them fail.
nelsonic commented 7 years ago

@des-des yes, this would be a good approach. 👍

des-des commented 7 years ago

@nelsonic could you be more specific?

nelsonic commented 7 years ago

@des-des the steps you have described don't appear logical to me. The UX when the migration step fails is: exit with a "non-zero" terminal output consisting of the reason why the script failed.

Migration schemas would be automatically created on the developer's machine by the script. i.e. on the localhost. and then preferably they should be included in the commit/PR so that we can version-control them.

As for having access to production data, we can simulate valid records for both schemas based on the data types for the columns. Then it should be straightforward to attempt to insert a row that is valid for one schema into the same table(s) in the revised schema and check if it worked. 👍

des-des commented 7 years ago

Sorry to have left this so long. FAC1N happened and kinda took over for a couple of weeks then updating back here got lost in my todo list.

About a month ago I spoke to the OA team, I'll quickly summarise the outcome of that conversation. OA have actually gone ahead and solved a part of their problem. Their DB now holds a table of run migrations. This means that for any instance of their db, they can make sure migrations are run once in order.

Anyway, I think we need to step back a little and think about his problem in the context of dwyl's new stack..

iteles commented 7 years ago

@roryc89 @Conorc1000 @naazy Is this still required for your project?

Postgres is still part of our stack, so this functionality may still be required.

des-des commented 7 years ago

@iteles

@roryc89 @Conorc1000 @naazy and if the answer is yes, can you be explicit about what is needed?

roryc89 commented 7 years ago

@iteles and @des-des. As our main DB pain point, the running of migrations multiple times, has been addressed, I don't think we require a data migrations tool in our project at the moment. Although we may need it in the future

nelsonic commented 7 years ago

@roryc89 thanks for replying and confirming. 👍 if that's the case and given that we are moving most of our other projects to "PETE" which has really good schema migrations, should we put this project on the "back burner" for now? 💭

khairnarTK commented 5 years ago

Is possible to write python script for production db data migrate to staging db for postgresql??

nelsonic commented 5 years ago

@khairnarTK it's definitely possible, Django has migrations: https://docs.djangoproject.com/en/2.2/topics/migrations/