A proposal for Improving the Safety of Extension Upgrades

lithp commented 8 years ago

It being well-known that updates require two parts; loading of the shared-library and updating of the extension; and understanding that these independent operations, despite their independence, require a certain ordering lest Postgres experience some error, or worse, the sufferance of segfault; I do humbly offer it to consideration that this unfortunate potential be reduced through such means as will be briefly outlined.

Firstly, that the shared-object, the industrious engine of Magick powering Citus, shall be provided some means of determining its version.

Secondly, Postgres providing no mechanism for the ordering of version numbers, preferring instead to wander through the Graph of versions by means of paths it dynamically constructs, that the Regex \d+\.\d+-\d+ (ex. 6.0-13) shall be strictly enforced and the shared-object taught a total ordering of versions using the obvious piecewise-numeric method.

Thirdly, that the upgrade-scripts, which currently operate much like Goethe's famous broomstick: without any regard for Expediency, shall be given such self-awareness as required to prevent the aforementioned segfaults, namely: that they shall refuse to run unless the shared-object reports being of a greater or equal version than the script itself.

I have duly submitted a request-for-pull, recorded here, and hope the remainder of these tasks may be soon addressed.

anarazel commented 8 years ago

(hm, replied via email, didn't take so far)

I don't think this is quite the right approach. We need to essentially make sure of the following:

The extension version may not be different than the "effective" shared library version. Effective meaning it's the one that's currently loaded, which will often be different from the one that's currently installed. If that's not the case, we need to refuse to use most of citus, but we still need to allow upgrading citus. Note that we cannot run queries etc with a newer loaded shared library either, as that'd potentially assume a different schema being in effect and thus crashing or worse do something returning invalid data or corrupting the schema.

I think this basically means that we need to check the extension version somewhere around where we do CitusHasBeenLoaded() checks. There we need to figure out the shared library and extension versions, and enforce that they match exactly. We don't need any SQL callable function for that - although probably useful independently - we can just check the corresponding variable in C code, and error out if that's not corresponding.

We might additionally want to have a check in one of the newer extension upgrade scripts that enforce that we're not run with a shared library that doesn't have the enforcement mechanism - that's easy, e.g. the creation of the function returning the loaded extension's "shared library version" will do that, because older functions wont have the correct symbol.

Let's chat in a bit in person?

marcocitus commented 8 years ago

I think this basically means that we need to check the extension version somewhere around where we do CitusHasBeenLoaded() checks.

Doesn't that mean that we will error out on basically everything? e.g. even \dx?

anarazel commented 8 years ago

@marcocitus That's why I'm thinking we have to do it around the places we do HasBeenLoaded(), not necessarily in it. But there's really not much we can allow - even metadata for checking whether a table is distributed is potentially unsafe with a divergent schema. As that's required for our planner hook, I don't think there's much choice :(

lithp commented 8 years ago

Enforcing exact matches means every Citus upgrade will require downtime. The downtime will usually be small but becomes arbitrarily large should the upgrade script fail for whatever reason. It sounds much better to make the code backwards-compatible with the SQL. You need to run UPDATE to take advantage of new features, but until you run UPDATE all the old features will keep working.

We've had difficulties with divergent schema but that's because we haven't been keeping the problem in mind when writing code. What if we add test infrastructure to check that Citus still works before the new schema has been applied?

anarazel commented 8 years ago

Enforcing exact matches means every Citus upgrade will require downtime.

I think that's pretty fundamentally required. You can't do much in an extension upgrade script in the first place with the old shared library loaded, e.g. you can't create new functions and such, because the symbols won't be visible/existing.

It sounds much better to make the code backwards-compatible with the SQL.

I think that's much more work than a) it sounds at first b) it's worth. Consider e.g. the nodeport thing - a very simple case - we need to add checks for the schema state to several places (reading, writing, and lookups). In other, more complicated, cases you need significant parallel infrastructure. Consider e.g. placementId or colocationId's not existing. Life gets complicated.

marcocitus commented 8 years ago

I quite like Brian's proposal for this. There is no way to reliably handle the schema being ahead of the code, but in many cases we could even operate normally when the code is ahead of the schema. If we error out by making the upgrade script check the version number, then the former can never occur. The latter can be handled on a case-by-case basis. For example, introducing a new UDF should almost never lead to issues.

anarazel commented 8 years ago

I quite like Brian's proposal for this. There is no way to reliably handle the schema being ahead of the code, but in many cases we could even operate normally when the code is ahead of the schema.

Neither placementId, nor colocationId introduction, nor the int8->int4 conversion, all discussed in this release, would be helped by having that. So it certainly doesn't help all that much.

The latter can be handled on a case-by-case basis.

Given that there were multiple bugs around this in this release alone, I think that's too optimistic. It also requires cluttering up the code with compatibility assumptions, which we essentially never can remove, as we need to support upgrading from an older version.

marcocitus commented 8 years ago

Neither placementId, nor colocationId introduction, nor the int8->int4 conversion, all discussed in this release, would be helped by having that. So it certainly doesn't help all that much.

So far we've mainly discussed altering after restart, but the user could just as easily alter before a restart which could lead to all sorts of bad things, including corrupting the metadata. I'd rather prevent that, independent of the altering after a restart problem.

anarazel commented 8 years ago

Neither placementId, nor colocationId introduction, nor the int8->int4 conversion, all discussed in this release, would be helped by having that. So it certainly doesn't help all that much.

So far we've mainly discussed altering after restart, but the user could just as easily alter before a restart which could lead to all sorts of bad things, including corrupting the metadata. I'd rather prevent it outright.

Completely agreed. But I don't think that really addresses the concern in the comment you quoted? Running new code without placementId e.g. requires checking for nulls in TupleToShardPlacement() which in turn will yield the correct but hard to understand:

        ereport(ERROR, (errmsg("unexpected null in pg_dist_shard_placement_tuple")));

that'll already prevent all queries from running. For the int8->int4 thing we need code checking the current schema state. And we can't really remove either anytime soon. I quite strongly suspect that most upgrades will have changes like this, we're pretty damn far from having a perfect schema ;)

The complications of having to think about compatibility in these cases, the likelihood of getting it wrong and the difficulty of reliably testing this, all seems to suggest that just forbidding running in such a configuration is a much better idea.

marcocitus commented 8 years ago

I think doing a version check instead of individual feature checks is fine. I'm mostly saying that a) I think erroring out on a premature alter extension is a good idea. b) I'd rather check schemaVersion >= 6.0-1 than schemaVersion == 6.0-2 and bump the version number in cases where we change metadata tables.

anarazel commented 8 years ago

b) I'd rather check schemaVersion >= 6.0-1 than schemaVersion == 6.0-2 and bump the version number in cases where we change metadata tables.

ISTM that that's a step that's often going to be forgotten, and over the course of a release there'll likely be a such a change, so there simply seems little benefit in the added complexity.

marcocitus commented 8 years ago

ISTM that that's a step that's often going to be forgotten, and over the course of a release there'll likely be a such a change, so there simply seems little benefit in the added complexity.

There's no benefit for developers, but it simplifies deployment and minimizes downtime for users.

For example, we typically make several connection attempts before erroring out, meaning we can usually survive a worker restart (though some of the retry logic could be improved). However, we cannot survive a worker version error. For modifications, we'd currently mark placements as inactive if we saw a version error, while a loose version check would have allowed the write to proceed. In a heavily loaded system this means an upgrade will more often require expensive shard repairs. The version errors would also be quite visible to applications which might become very annoying.

anarazel commented 8 years ago

There's no benefit for developers, but it simplifies deployment and minimizes downtime for users.

I'm not convinced. How's getting an obscure error message any better than a proper one?

For example, we typically make several connection attempts before erroring out, meaning we can usually survive a worker restart (though some of the retry logic could be improved). However, we cannot survive a worker version error.

For that it'd be nice if we could figure out a way to only error out if distributed tables are accessed. That'd mean we need some limited form of compatibility for pg_distshard, but not for anything else. We'd not necessarily error out in worker and some other functions not using the metadata cache and related infrastructure, but that's probably ok. If needed we can add checks to the individual places. That'd also allow us to run non-citus queries, which seems good.

For modifications, we'd currently mark placements as inactive if we saw a version error, while a loose version check would have allowed the write to proceed.

That could partially be addressed by using a recognizable error code, like e.g. the constraint error checks in router executor.

marcocitus commented 8 years ago

I'm not convinced. How's getting an obscure error message any better than a proper one?

Not sure what the obscure message is. We should have a version-related error if your code is below the minimum required version and no error message otherwise.

There are plenty of other reasons why not requiring the versions to be strictly in sync simplifies deployment. For example, client libraries also try to establish connections multiple times or restore a previously established connection, so they can survive a restart, but they won't expect a version error.

When upgrading, not having explicit version checks allows you more flexibility around how you update. For example, you could first restart all your workers before altering extension on all of them without harm. You could promote a secondary and move traffic simultaneously and run alter extension afterwards. Etc.

How to upgrade is a big concern for many users and we definitely shouldn't make that harder.

lithp commented 8 years ago

Enforcing exact matches means every Citus upgrade will require downtime.

I think that's pretty fundamentally required.

There's a real difference between these options:

1) Stop sending all queries to the database. Restart the database. Run UPDATE. Start sending queries again. 2) Restart the database. Run UPDATE. Start using new features 3) Restart the database. Start using new features, but sometimes an error has happened and you take unexpected downtime.

Technically the restart also counts as downtime but libraries usually retry can't connect errors and usually surface "the database gave you this weird error" errors, it's easier to integrate with a database if you can assume it's never in a state where it accepts queries but returns errors for all of them.

Neither placementId, nor colocationId introduction, nor the int8->int4 conversion, all discussed in this release, would be helped by having that. So it certainly doesn't help all that much.

You're right that there's likely no code we can write now which will save us from thinking about this problem going forward. However, all those PRs would fit under this framework. We've already done the work to make int8->int4 cleanly handle partial upgrades.

The complications of having to think about compatibility in these cases, the likelihood of getting it wrong and the difficulty of reliably testing this, all seems to suggest that just forbidding running in such a configuration is a much better idea.

For testing we can add a schedule which runs through all the upgrades, not including the current version, then runs all the usual tests.

There will be some tests which rely on the new features so those won't be part of the upgrade_check_schedule, and part of the release process is moving the last versions' tests into it.

anarazel commented 8 years ago

Not sure what the obscure message is. We should have a version-related error if your code is below the minimum required version and no error message otherwise.

I was referring to

ereport(ERROR, (errmsg("unexpected null in pg_dist_shard_placement_tuple")));

added with you looking over my shoulder ;)

Looking at this realistically, there's no way we're going to work properly without pg_dist_node present or without the placementId column existing.

There are plenty of other reasons why not requiring the versions to be strictly in sync simplifies deployment. For example, client libraries also try to establish connections multiple times or restore a previously established connection, so they can survive a restart, but they won't expect a version error.

That doesn't change if we error out in only some cases, when still required.

How to upgrade is a big concern for many users and we definitely shouldn't make that harder.

I agree, but adding complexity that ends up not helping, doesn't address that. I'm not going to protest further, but I'll bet you a bottle of good scotch that we'll forget updating the minimum required version in one of the next two releases, and that we'll have multi-roundtrip discussions about when to increase it over the next releases.

anarazel commented 8 years ago

Neither placementId, nor colocationId introduction, nor the int8->int4 conversion, all discussed in this release, would be helped by having that. So it certainly doesn't help all that much.

You're right that there's likely no code we can write now which will save us from thinking about this problem going forward. However, all those PRs would fit under this framework.

How does placementId and/or pg_dist_node introduction fall under that if we don't error out?

We've already done the work to make int8->int4 cleanly handle partial upgrades.

Not quite, it's missing the case where the loaded shared library version is too old. But that's something we're planning on handling under this umbrella.

lithp commented 8 years ago

For pg_dist_node you can change the upgrade script to create the table and add the rows in a single transaction. The rest of the code either reads from the file or from the table depending on whether the table exists. It's a little ugly but the code is only redundant until we cut a release, then we can remove the part which reads from the file.

anarazel commented 8 years ago

@lithp Huh? Unless we keep that code for a fair while, we've gained absolutely nothing. What we're trying is to make the upgrade easier, that's not going to be achieved if we remove the code at the release. We can't even remove it at release+1, because users often skip releases. Nor will this actually be reasonably testable. I'm going to protest VERY HARD against trying to support that. I can live with not making all upgrades "required hard", but I think it'd be a bad bad mistake trying to always have compatibility. It'd e.g. certainly delay this release.

anarazel commented 8 years ago

I think this basically means we want to:

Add code making extension version available from the shared library. I think for the purpose of this issue returning the version numerically (e.g. 5022,6013) would be best
declare the minimum required schema version supported by the shared library, that probably needs to be a define somewhere
during/after extension upgrades verify that the extension's version is not newer than the shared libraries. This prevents the scenario of doing an ALTER EXTENSION UPDATE without a restart after the installation of the shared library.
In strategic places, e.g. before loading distributed table metadata, add an error check about the minimum required schema version.

anarazel commented 8 years ago

It'd e.g. certainly delay this release.

To clarify: It'd do so because we'd need to re-add code that populates the worker hash from the file, in InitializeWorkerNodeCache(). And we'd need to keep and test that code until we don't support upgrading from a version that still supported pg_worker_list.conf.

lithp commented 8 years ago

Re it'd delay this release, you're absolutely right. I think it's too late to implement for this release. I'm thinking about your other points.

lithp commented 8 years ago

Okay, if users skip releases I agree with you that complete backwards compatibility is likely quite difficult. I still think aiming for zero downtime should be our goal though.

Just realized that this whole thread I've been arguing against something from a private conversation but which you've never mentioned here: checking that the version matches and automatically running UPDATE if it doesn't. I don't like that there's the potential for the UPDATE to fail for some reason but in most cases (like adding functions or columns) there's no danger of that happening.

@marcocitus @anarazel What do you think of taking the above list, changing it back to an exact match instead of a minimum version match, and in a startup hook running UPDATE if the version is lower than the code's version?

ozgune commented 7 years ago

We talked about our new minor release upgrade policy in an internal meeting. I think the following part is relevant to this issue (copy/paste from @sumedhpathak's email):

Minor release: e.g. 6.1 -> 6.2. This release contains new functionality, and can contain catalog table or API changes as long as they are additive and don't break previous functionality. A change which we consider as breaking this policy would be removal of a catalog table or GUC.

User's will need to explicitly update via:

yum install citus-6.2
restart backend
'alter extension citus update'.

To make sure that users don't run into issues after installing a new version, we have to make sure that they restart their backend and run an ALTER EXTENSION CITUS UPDATE. Starting with 6.2, we'd like to have a mechanism to enforce that once the user installs a new Citus version, they also run ALTER EXTENSION.

@lithp -- I realize that we missed you in the Citus versioning meeting. Does this sound good overall?

lithp commented 7 years ago

Okay, if I can restate it in my own words:

We add a fail-safe preventing you from running ALTER EXTENSION before you've restarted
For minor releases we allow the user to restart and then ALTER at their leisure. (And restrict ourselves to the kinds of changes which allow such)
For major releases we fail-safe and prevent the user from running any distributed queries until ALTER has been run.

If that's what you decided, it sounds like a reasonable trade-off.

sumedhpathak commented 7 years ago

@lithp

We also use the fail-safe for minor releases. The conclusion primarily has been that we expect an ALTER EXTENSION even for minor releases, and that without that we may crash or exhibit other bad behavior. The packaging changes also re-inforce this, in that we don't automatically upgrade someone from 6.1 to 6.2, and thus make it a conscious decision to do so.

lithp commented 7 years ago

I don't want to argue too much if we've already gone through all the effort of coming to a decision.

It's a little unfortunate that we're forcing users to take downtime even for minor releases though, what's the point of restricting ourselves to additions if we're still going to require that users run alter extension before accepting any queries?

byucesoy commented 7 years ago

We had a meeting about this, here is the summary of what we decide;

We will error out when binary version and available version are different. (scenario 1 above)
- In our discussions, we decided to favor date safety over convenience.
We will error out when binary version and catalog version are different. (scenario 2 above)
- In this scenario, binaries may expect to see some metadata which is not created yet. Thus causing crash or data corruption. Erroring out here is must.
We will only check for scenario 1 when new connection is opened and cache the result.
- citus.control file can change any time and checking it at each query execution is costly. We will check it only at new connection.
As implementation detail, we will use DirectFunctionCall instead of SPI.
- Otherwise we would intercept the SPI itself and try to do version check (with again SPI).

jasonmp85 commented 7 years ago

Curveball question: do we support downgrading at all? How do people roll back from 6.2 to 6.1 if they have deployment problems?

byucesoy commented 7 years ago

We don't support downgrading. User has to DROP EXTENSION and re-create it. I am not aware of better way if there is one.

citusdata / citus

A proposal for Improving the Safety of Extension Upgrades #907