ansible / django-ansible-base

Apache License 2.0
12 stars 43 forks source link

Reverse-syncing of resources to a resource server #548

Closed relrod closed 1 month ago

relrod commented 1 month ago

This implements reverse syncing from services to a resource server.

In the interest of trying to prevent as many failure scenarios as possible (though this is NOT failproof as @AlanCoding points out below in the comments), we wrap the REST call and the local commit in a transaction (unless a transaction is already present).

For modules which sync (users, orgs, teams), their save() method is monkeypatched to wrap the whole save in a transaction, if necessary.

A pre_delete and post_save signal is used for doing the actual sync.

In order to prevent sync loops, we sneak an internal attribute on instances that we create/fetch on requests from a resource server. In the syncing function, we look for this attribute and abort syncing if it's there and True.

Unit tests (somewhat gnarly ones) are provided. They attempt to ensure that we're calling the right REST client methods when we need to, bailing out early when we need to, rolling back when we need to, and committing when we need to. Coverage should be pretty darn close to 100%.

A new setting, DISABLE_RESOURCE_SERVER_SYNC works to disable the reverse syncing. We use it in test_app to avoid syncing on all but the specific reverse syncing tests (the ones written in this PR).

AlanCoding commented 1 month ago

Can we talk out the specific scope of this?

This is strictly for CRUD of a specific shared item, meaning POST/PATCH/DELETE of an org/team/user in a service.

When that CRUD happens, this is expected to fire off a request to POST/PATCH/DELETE to the corresponding endpoint under the /service-index/resources/ endpoint in the resource provider? Good. It's really important to write that out and get everyone on board. One sentence, we should repeat it many times over.

To wrap up the scope, what I need to know is then, what is expected of the resource provider? Is it expected that the resource provider then syncs that change to other services? Either way, you have more explaining to do. This can't be left unsaid.

I'll continue with "no", the resource provider will not sync that change to the other services. This will leave the cluster out-of-sync until those other services run the 15-minute periodic resource sync, originally by @rochacbruno. I would agree that's the best answer, but it's not obvious, and to the best of my knowledge, this is the first description of such a plan.

relrod commented 1 month ago

@AlanCoding yes, this all needs to be documented, this is still a prototype and WIP, that's why it's still a draft :smile:

You're correct that, to the best of my knowledge, there's no plan to trigger an immediate sync from the resource server to the other services, so then yes there would be a period of time where the cluster will be out of sync until the services all check in for an update.

AlanCoding commented 1 month ago

@newswangerd to document part of my hesitation here, AWX has a well-known category of bugs that only happen on commit. Example:

https://github.com/ansible/awx/issues/11165

tools_awx_1     | awx-uwsgi stdout |   File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 262, in commit
tools_awx_1     | awx-uwsgi stdout |     self._commit()
tools_awx_1     | awx-uwsgi stdout |   File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 240, in _commit
tools_awx_1     | awx-uwsgi stdout |     return self.connection.commit()
tools_awx_1     | awx-uwsgi stdout |   File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
tools_awx_1     | awx-uwsgi stdout |     raise dj_exc_value.with_traceback(traceback) from exc_value
tools_awx_1     | awx-uwsgi stdout |   File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 240, in _commit
tools_awx_1     | awx-uwsgi stdout |     return self.connection.commit()
tools_awx_1     | awx-uwsgi stdout | django.db.utils.IntegrityError: insert or update on table "main_host" violates foreign key constraint

This kind of bug might be racy, but it's absolutely not at all uncommon. I mention it because it's a time when errors happen between the sync_to_resource_server(self, action) call and the closing of the transaction, and because it's intractably unavoidable. So it's reasonable to imagine that the sync is successful, and then the transaction still gets rolled back, leading to an inconsistency.

I just don't want promises that this is going to be fully transactionally robust, because it simply can never be.

newswangerd commented 1 month ago

@AlanCoding This is a pretty big bummer. I had naively assumed that the transaction would check for foreign key constraints before commit.

Over all, I am more concerned about bugs that could cause the resource to get created in the services and NOT on the resource server. Anything that is created on the resource server will ultimately get pulled down to each service, so it will be eventually consistent. The big problem with this is that if you try to create the resource in the service, and it already was created in the resource server, the uniqueness constraint will fail on the resource server and prevent you from creating it locally.

I was thinking we could resolve this problem by switching the sync_to_resource_server to be a get_or_create operation. That way if the resource already exists on the resource server, the function will resolve in a sync instead of a create. There may be problems here that we have not considered, but I'd like the two of you (@AlanCoding and @relrod) to spend some time considering downsides for this approach.

One less common, but possibly nastier class of problems that could result from this is failures during updates. This will only matter in cases where names (or usernames) are updated, which is not a common thing to do, but the result would be that the resource is renamed in Gateway, but not on the service. I think this will get resolved during the periodic sync, but there will be a maximum of 15 minutes where the names will be out of sync on the two systems and potentially causing confusion.

AlanCoding commented 1 month ago

Over all, I am more concerned about bugs that could cause the resource to get created in the services and NOT on the resource server. Anything that is created on the resource server will ultimately get pulled down to each service, so it will be eventually consistent.

Yes, I agree. The scenario I mentioned is actually relatively okay. That scenario is:

In that case we would just wait for the periodic_resource_sync. And in the case of AWX, we could just schedule the sync task async in the request failure handler (hopefully based on some flag indicating that the Gateway resource was created). So this is actually all-okay.

I was thinking we could resolve this problem by switching the sync_to_resource_server to be a get_or_create operation.

It doesn't do this now? Are you saying it doesn't create resources? Agree, I'd prefer it to create items from the resource provider if not locally present.

possibly nastier class of problems that could result from this is failures during updates. This will only matter in cases where names (or usernames) are updated

I'd again point to the option I mentioned here, scheduling periodic_resource_sync in finalize_response. We probably can't do this in DAB shared base views or anything like that, because the means of scheduling an async task is completely service-specific.

Or maybe, instead of finalize_response, hook into

https://docs.djangoproject.com/en/5.0/ref/signals/#got-request-exception

Using signals might give us a better shot at sharing data to know whether or not the request has made a change in Gateway.

AlanCoding commented 1 month ago

I would like to merge https://github.com/relrod/django-ansible-base/pull/1 in with the rest of this. Because that will avoid nested transaction in practice for eda-server requests specifically, which I'm pretty specific about not wanting.

AlanCoding commented 1 month ago

Whenever you get into full-cycle integration testing, what I'm predicting is:

Any logic connected to post_migrate signals that modify users/teams/orgs will try to sync any of the changes that it makes to the resource provider. Same for any logic ran manually after migration. This is a problem for installation ordering. Components must finish their migrations and then the resource provider must merge their data (migrate_service_data). What is expected run, order-wise, before the migration in the install story? This, for one:

https://github.com/ansible/awx/blob/devel/awx/main/management/commands/create_preload_data.py

Should we turn off the syncing while that is running?

sonarcloud[bot] commented 1 month ago

Quality Gate Passed Quality Gate passed

Issues
12 New issues
0 Accepted issues

Measures
0 Security Hotspots
98.3% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud