Doc changes for Metadata sync transactional/non-transactional mode

pinodeca commented 1 year ago

Why are we implementing it? (sales eng)

Customers with very large clusters have problems adding new nodes to their clusters and upgrading above Citus 11.0, which introduced query from any node, due to memory problems during metadata sync.

What are the typical use cases?

Adding a new node to a cluster
Upgrading clusters >= Citus 11.0

Communication goals (e.g. detailed howto vs orientation)

Good locations for content in docs structure

We might use here to explain the GUC citus.metadata_sync_mode
We miss some of the udfs related to metadata here . Maybe, we can add the udf start_metadata_sync_to_all_nodes there and mention about how it performs differently according to the GUC. Or just skip this and explain only the GUC.

How does this work? (devs)

We added an alternative non-transactional mode to the current metadata sync which performs inside a single transaction. Single transaction mode causes issues since PG has a hard memory limit related to cache invalidations. But now, we can switch into non-transactional mode, which syncs the metadata via many transactions, if we have such a memory error.

Example sql

To add a new node:

SET citus.metadata_sync_mode TO 'nontransactional';
SELECT citus_add_node(<ip>, <port>);

To sync all the nodes:

SET citus.metadata_sync_mode TO 'nontransactional';
SELECT start_metadata_sync_to_all_nodes();

Corner cases, gotchas

Default sync mode is transactional.
When we hit hard memory limit by PG, we can switch into non-transactional mode. Only super users can switch into this mode. It is operators' responsibility to handle such cases.
Non-transactional mode is designed to be idempotent, which means we can run it over and over again even if the operation fails in the middle.
We set the flag metadatasynced as false at start of the non-transactional sync so that we would know that something went wrong during the sync.
Non-transactional mode cannot be run inside a transaction block.

Are there relevant blog posts or outside documentation about the concept/feature?

Not yet. But I plan to publish a blog post.

Link to relevant commits and regression tests if applicable

You can see the related PR https://github.com/citusdata/citus/pull/6728 Regression tests:

pinodeca commented 1 year ago

@aykut-bozkurt Can you please fill in each of the template sections in the description above?

jonels-msft commented 1 year ago

@aykut-bozkurt thanks for the details, I have a few follow-up questions.

It sounds like non-transactional sync can fail but can be re-run safely. Is it the user's responsibility to check for failure? On failure does the user need to manually retry their calls to functions like citus_add_node() and start_metadata_sync_to_all_nodes()?
Should we advise users to try transactional mode first and see if they get an error, then switch to non-transactional and try again? Or is the OOM problem pretty severe, so that we should advise people with clusters greater than a certain size to always go non-transactional? If so, what size?
If transactional sync can die and non-transactional can't, why is our default transactional? I want to present the pros and cons of both options clearly in the docs.

aykut-bozkurt commented 1 year ago

It is user who should manually rerun after any failure,
Default mode is transactional which is safer in case of a failure since we perform inside single transaction, which means either commit or rollback all the changes to end up at a consistent state, (They only should switch into non-transactional mode when memory failure hits)
As I said transactional mode performs atomically, but non-transactional sync can fail and end up at inconsistent state even if user can manually rerun it safely later.

citusdata / citus_docs