coral-erm / coral

CORAL ERM main repository

http://coral-erm.org/

Other

52 stars 64 forks source link

Database Versioning #285

Closed jcuenod closed 1 year ago

jcuenod commented 7 years ago

I was chatting to @jeffnm about the Installer modifier and I'm thinking about how to gather information about what's installed. One of the issues is that knowing what version we're sitting on for the DB (and specific tables within the DB) is pretty important. Apparently @t4k has been thinking about this as well - please weigh in if you have thoughts!

Right now one thing I'm thinking is to store details in the config file but I'm reluctant to do that because it's not "configuration." I don't know how rails does it but it's pretty awesome (their migrations thing). If anyone has ideas about versioning the db, that would be great. Also, I don't think I've completely problematised the issue developers face as sql files are built to provide for upgrades so please weigh in on that score as well.

PaulPoulain commented 7 years ago

The way we do it in Koha: The SQL table systempreferences contains many parameters and preferences (more than 500 today !!!!) One of them is "KohaVersion". This syspref contain the database number. The script updatedatabase check again the DB number, and applies new changes. Example here : http://git.koha-community.org/gitweb/?p=koha.git;a=blob;f=installer/data/mysql/updatedatabase.pl;h=134280678e8b46db3cc5a2a8267f8dafb17728d9;hb=HEAD

Retrieve the cuttent version : 81 my $original_version = C4::Context->preference("Version"); Applies the changes for version 3.00.00.001 if needed

  83 # Deal with virtualshelves
  84 my $DBversion = "3.00.00.001";
  85 if (C4::Context->preference("Version") < TransformToNum($DBversion)) {
  86     # update virtualshelves table to
  87     #
  88     $dbh->do("ALTER TABLE `bookshelf` RENAME `virtualshelves`");
  89     $dbh->do("ALTER TABLE `shelfcontents` RENAME `virtualshelfcontents`");
  90     $dbh->do("ALTER TABLE `virtualshelfcontents` ADD `biblionumber` INT( 11 ) NOT NULL default '0' AFTER shelfnumber");
  91     $dbh->do("UPDATE `virtualshelfcontents` SET biblionumber=(SELECT biblionumber FROM items WHERE items.itemnumber=virtualshelfcontents.itemnumber)");
  92     # drop all foreign keys : otherwise, we can't drop itemnumber field.
  93     DropAllForeignKeys('virtualshelfcontents');
  94     $dbh->do("ALTER TABLE `virtualshelfcontents` ADD KEY biblionumber (biblionumber)");
  95     # create the new foreign keys (on biblionumber)
  96     $dbh->do("ALTER TABLE `virtualshelfcontents` ADD CONSTRAINT `virtualshelfcontents_ibfk_1` FOREIGN KEY (`shelfnumber`) REFERENCES `virtualshelves` (`shelfnumber`) ON DELETE CASCADE ON UPDATE CASCADE");
  97     # re-create the foreign key on virtualshelf
  98     $dbh->do("ALTER TABLE `virtualshelfcontents` ADD CONSTRAINT `shelfcontents_ibfk_2` FOREIGN KEY (`biblionumber`) REFERENCES `biblio` (`biblionumber`) ON DELETE CASCADE ON UPDATE CASCADE");
  99     $dbh->do("ALTER TABLE `virtualshelfcontents` DROP `itemnumber`");
Print on the console, reporting the update to the sysadmin
 100     print "Upgrade to $DBversion done (virtualshelves)\n";

Set the version. Thus, if there's a ctrl-c before the end, we've stored that we've already bumped to 3.00.00.001

 101     SetVersion ($DBversion);
 102 }

Now apply 3.00.00.002

 105 $DBversion = "3.00.00.002";
 106 if (C4::Context->preference("Version") < TransformToNum($DBversion)) {
 107     $dbh->do("DROP TABLE sessions");
 108     $dbh->do("CREATE TABLE `sessions` (
 109   `id` varchar(32) NOT NULL,
 110   `a_session` text NOT NULL,
 111   UNIQUE KEY `id` (`id`)
 112 ) ENGINE=InnoDB DEFAULT CHARSET=utf8;");
 113     print "Upgrade to $DBversion done (sessions uses CGI::session, new table structure for sessions)\n";
 114     SetVersion ($DBversion);
 115 }

116 We're using this for more than 10 years now.

Before version 3.0, we had another script that was checking each table/index/FK/ ... to see if it was here and/or has to be updated. It became a nightmare because checking everything is VERY hard when the software evolves a lot.

See http://git.koha-community.org/gitweb/?p=koha.git;a=blob;f=installer/data/mysql/update22to30.pl;h=f10ace95c5ab624adb9490304e7cb30a67434ee7;hb=HEAD if you're curious

t4k commented 7 years ago

Thanks for prompting me to finish my thoughts on this, @jcuenod. I have thought a lot about this and never finished the write-up that I promised @jeffnm.

Problems

Database schema updates are proposed and accepted at various points in the development cycle. Now that we have a development branch in the project, developers working on the project are expected to be working off that branch to make new improvements.

In the preparation for the 2.0 release the compilation of *.sql files became a little unwieldy. Some of the files were not included in both the install.sql file and the update_2.0.0.sql file. Maintaining two copies of the same SQL statements will likely lead to this problem recurring.

Manual application of *.sql files is quite tedious when trying to test pull requests or, more importantly, when doing version updates for end users.

Ideas

Similar to @PaulPoulain’s example above, I propose adding a System table to all the databases (for now, until we consolidate) that contains at least a schemaVersion column. This could be the same or different for each CORAL module, I don’t think it matters too much.

We could consolidate our install.sql and update_X.X.X.sql files into single system. The installer would read only one set of instructions, but they would be ever growing as we continue to update the code. For each schema change in every pull request we would assign a sequential number, independent of the CORAL release version. This could be in an update function name, or a variable, or whatever mechanism we use to add additional schema changes. The installer would then read through every update number in sequence while it is doing the a fresh install, and then on an update, it would look up the schema version for the db table and only run the schema updates with numbers greater than the current one.

Developers would then have a way to check out a version of the code, which have built-in schema numbers in the SQL update file. If the newest code requires a schema update it is as easy as running the same update script every time (a UI would need to be made for end users).

I think I’ve covered all my thoughts on this. I’m glad this discussion is happening!

PaulPoulain commented 7 years ago

continuing with schema update thoughts and ideas:

install & update in a single system: yes, but we can also do something like "install is updated only when releasing a major version, and update is used for all versions". It means that someone installing (from scratch a version 2.6 would run install.sql (2.0) then update.sql (2.0 -> 2.6). When releasing 3.0, install.sql is immediately 3.0 compliant, and update.sql will be run, but do nothing. The install then update can be automated during installation process.
about the sequential number independent from the release version I'm not in favor of it. It's very useful to be able to answer easily this question: "in which version has field resource.blablaNewField been added" ? looking at update.sql, with a purely sequential number, you can't say. With a version-sequential, you know. In Koha, you can see in updatedatabase that a given change has been made in 3.22.00.003 => the 4th part is the sequential part. It's the 3rd DB change in version 3.22.00 (1st release of version 3.22)
about the system table, there's a larger question: where (and how) do we want to store parameters. Currently, some are stored in config.php files. I don't think it's the best option. If a library want to switch enhancedCostHistory from N to Y, it must be done by an IT person. This kind of change should be doable by a librarian, from the interface. So I think that parameters must clearly be splitted between technical parameters (like DB login/password that must be stored in config.php) and more functional parameters that could be managed from the interface. My feeling here is that most parameters in the current [settings] section of config.php could be moved to a "librarian driven" interface.

As a consequence, those parameters must be stored in the database (we don't want librarians to have permission to update config.php, don't we ?) . Several options/ideas here:

a system table with a column for each parameter. pro: easy to read, cons: lot of columns, only one line
a system table with 2 columns ("parameterName" and "parameterValue"). pro: easier to extend, cons: getting a parameter requires a "SELECT parameterValue from system WHERE parameterName=?"
- a system table with a single blob field, containing a json containing all parameters. pro: easy to extend, easy to read, cons: harder to manage (impossible to update directly a parameter with an UPDATE ...) [Can be really painful for developers]

I'm aware that this change would be large & not that easy to implement... (maybe it's something for the roadmap ???)

Last topic: how to deal with developers setups ? In the Koha project, we've something that is not 100% satisfactory but we have not found a better way: if a developer does a patch/PR that changes something in the DB, (s)he creates a file named update_XX.sql (with XX being XX). The installer/upgrader, if such a file exist, automatically runs it. When merging the PR, the update_XX.sql file is removed and his content is put into the updatedatabase script, with the final official number (this is done by the "Release Manager")

jeffnm commented 7 years ago

These are some great suggestions for ways we can improve our database versioning, particularly during development.

Database version comments

I like the idea of marking each DB change with a sequential tracker and I see Paul's point about integrating CORAL's version into that number. However, I see that we don't always know which version of CORAL a DB change will be released in, or there may be a bug fix release that occurs after a DB changes have been merged into the development branch, but before a new release has occurred. The goal is that each database change is always applied after the previous one and should never be run again if it's already been applied to the database, but with version tagged sequences.

The fact that we have multiple databases makes this harder. I think adding a System table to all databases with schemaVersion is appropriate. I would like to see us find a way to use the same schemaVersion value for all modules, but since a merged database will probably be a version 3 release, I'm ok with using Paul's suggestion of including the major release version in the schemaVersion. That way, if each module has a different schema version in coral 2.1, those different schema version numbers will not be a major hurdle when merging into a single database. We can simply start with 3.sequential db change.

@jcuenod would merging install.sql and update_x_x.sql files like @t4k suggested cause major problems for the installer?

Configuration comments

@PaulPoulain I agree entirely that we should move many of our config file settings into the database and provide UI options for the users to change them in CORAL.

In addition to the significant database changes that you point out would be needed, I think we also need to consider having an overhauled settings/administration section that looks at what modules are enabled and provides a consistent and centralized place to modify relevant settings for each of those modules.

The biggest place I see this need is in User management. We should be able to toggle permissions for a single user across all modules in one place.

As @jcuenod is looking at ways to allow users to enable or disable modules after installation, having a central place to put configuration UI would be very helpful.

PaulPoulain commented 7 years ago

DB version comment of Jeff comments

about " I see that we don't always know which version of CORAL a DB change will be released in". Correct. And that's why the Koha project use update_XXX.pl trick: the DB version is set when merging the PR. Additional info: that's also a reason to use odd/even numbers: when working (merging), Koha DB updates use odd number. At the time of release, the Release manager creates an even version number, that's this number used as release #. Example: we had 3.20, next stable will be 3.22. DB changes between 3.20 and 3.22 release are numbered 3.21.00.000, 3.21.00.001, 3.21.00.002 ...
I had completely forgotten the multiple DB problem ! I agree that we must have a System table for each module with shemaVersion field. preparing the future, what about having a shemaVersionResource field for the Resources module, a schemaVersionLicense for licensing, ... ? when merging DB, that will help us (OTOH, we will probably have only 1 schemaVersion field once DB merged. So maybe not a good idea)

Configuration comments

What about putting module enable/disable in the UI as well. It's not a very technical topic, more a functional one. a librarian may want to enable the licensing module, he should not need the IT to do that.
+1000 for user/permission management all in one place. ATM, if I'm not wrong, a given user can have a different userID across modules. That's one of the biggest problem for merging DB (IIRC what @veggiematts told me a while ago)

t4k commented 7 years ago

A few more comments.

install & update in a single system: yes, but we can also do something like "install is updated only when releasing a major version, and update is used for all versions". It means that someone installing (from scratch a version 2.6 would run install.sql (2.0) then update.sql (2.0 -> 2.6). When releasing 3.0, install.sql is immediately 3.0 compliant, and update.sql will be run, but do nothing. The install then update can be automated during installation process.

This seems overly complicated to me. Why wouldn’t the installer just run every SQL statement that is available so that a fresh install is immediately up-to-date with the latest schema? A two-step, install then update, process seems like extra work.

Also, I may not have been clear in my ideas section above. I imagine it would be best to not run *.sql files directly, but include the queries in functions/methods that are part of the sequentially named updates. That way the installer and the updater can both run the same code.

Last topic: how to deal with developers setups ? In the Koha project, we’ve something that is not 100% satisfactory but we have not found a better way: if a developer does a patch/PR that changes something in the DB, (s)he creates a file named update_XX.sql (with XX being XX). The installer/upgrader, if such a file exist, automatically runs it. When merging the PR, the update_XX.sql file is removed and his content is put into the updatedatabase script, with the final official number (this is done by the “Release Manager”)

To me, the fact that an update_XX.sql is temporary is inefficient. Why not have developers make a PR that updates the schema file (schema.php or something) directly? If the concern is that schema version numbers might conflict when multiple people make PRs at the same time, I don’t see that as a big problem because so few PRs deal with schema changes as it is. The idea of a “final official number” for a schema change is not as important as simply having a unique incrementing number so that the update is included when running the installer or updater. It could be dealt with when resolving merge conflicts instead of imposing inefficiencies on all developers and release managers.

The benefit of this would be that developers have an easy way to test schema changes that behaves the same way as our end users. Simply run the installer or updater after pulling the latest code and you will be good to go. If everything tests well then the PR could be merged straight away without mucking about removing files and adding code to another file.

jcuenod commented 7 years ago

Okay, just to document, I was thinking about the way rails handles this (which is pretty cool IMHO) and I found this cool explanation: https://stackoverflow.com/a/8236416

PaulPoulain commented 7 years ago

@jcuenod thanks for the link, very interesting. Unclear to me whether it's a solution we could use for Coral. My feeling is that it's a little bit too complex, but I may be too shy...

@t4k about the 2 steps (install + update) I don't understand your question: we already have an existing schema (the install), and we need to update existing users database, so I don't see how we could have only 1 file for everything.

@t4k temporary _xx.sql inefficient: I agree (that's why I wrote 'not 100% satisfactory'). But before we used this "trick", we had a lot of problems with conflicting PR. The Koha project is much larger than Coral (for now, at least ;-) ), we've 200+ [1] PR waiting at anytime, so it was a nightmare for developers to rebase & resubmit their patches everytime someone else patch is merged and create a conflict with my pending patch. Maybe the Coral project is small enough to discard this risk. Or maybe we plan to become a large project, and it's easier to build long term foundations. TBD !

[1] today, it's 205: https://bugs.koha-community.org/bugzilla3/buglist.cgi?cmdtype=dorem&list_id=212244&namedcmd=Needs%20Signoff&remaction=run&sharer_id=1, plus 94 tested and waiting for QA control https://bugs.koha-community.org/bugzilla3/buglist.cgi?cmdtype=dorem&list_id=212245&namedcmd=Signed%20Off&remaction=run&sharer_id=1 !!!

jeffnm commented 7 years ago

Correct me if I'm wrong, but I think we are all saying we want to have a sequence of SQL statements that are run to take an instance of CORAL from whatever version it's at (including not yet existing) to the most recent version.

It would make sense for the installer/upgrader to actually be one tool that looks at all the SQL statements (written in PHP like Paul's Koha example) and compares it to the database to determine which ones still need to be run.

Using time stamps like @jcuenod's Rails example would provide almost guaranteed conflict free sequential updates, but a Major.Minor.DBSequence schema numbering system like @PaulPoulain's example would probably be fine too.

The key is that each database change is entered discretely and is uniquely identifiable so that the database install/upgrade process can run at any time and not fail with errors because some statement had already been run. And if it's stored in a PHP file rather than an SQL file the different steps could all be stored in a single file.

I'm convinced that keeping this information in the database is necessary. The only question in my mind is whether we keep a single schemaVersion or if we keep a table full of schemaMigrations values. I think a schemaMigrations table would be less error prone.

PaulPoulain commented 7 years ago

@jeffnm having a schemaMigration table with all SQL to run can be an option. I just want to point that it can be the source of some problems when some users could run A/B/C and others A/C/B: in some circumstances, running B + C don't have the same result as running C + B. Quite uncommon, but can happen.

Example:

A= CREATE TABLE XXX (with field A1 and A2)
B= ALTER TABLE XXX ADD field B
C= ALTER TABLE XXX ADD field C Depending on the order, the table will have the columns: A1,A2,B,C or A1,A2,C,B Which should be harmless in theory, but you know, theory and practice is the same, but only in theory, not in practice...

t4k commented 7 years ago

I just want to point that it can be the source of some problems when some users could run A/B/C and others A/C/B: in some circumstances, running B + C don’t have the same result as running C + B. Quite uncommon, but can happen.

I don’t understand how this would ever happen if we have an installer/updater running the database changes. They would be run in the order specified in the numbering sequence. No human would ever apply individual updates.

@t4k about the 2 steps (install + update) I don’t understand your question: we already have an existing schema (the install), and we need to update existing users database, so I don’t see how we could have only 1 file for everything.

We would have to label all existing installs in the database with a particular version first, before the updater can run.

The Rails example is informative. Here is another example of how this can be done and how they do it in Drupal: https://api.drupal.org/api/drupal/modules%21system%21system.api.php/function/hook_update_N/7.x http://cgit.drupalcode.org/drupal/tree/modules/user/user.install?h=7.x

In the second link, look at the bottom for the update functions. There is also a user_update_dependencies() functions that explicitly spells out which updates require others to come before and a whole system of checking behind it.

PaulPoulain commented 7 years ago

@t4k if you don't understand what I'm saying, then probably there's something I don't understand in your proposal. A Proof of Concept could help me...

@t4k : about the labeling of all existing install, OK, got it now, thanks

fondrenlibrary commented 7 years ago

The article found at the following link is interesting along with its comments. https://www.infoq.com/articles/db-versioning-scripts#anch100760

jcuenod commented 7 years ago

@fondrenlibrary agreed, interesting read. Are the comments making you think in favour of using a 3rd party tool for the job?

fondrenlibrary commented 7 years ago

No clue until some test can be tried~

jeffnm commented 6 years ago

@jcuenod As you work on #302 will you add the version tracking tables? If so, can you describe your proposed database changes to accomplish that?