WeblateOrg / weblate

Web based localization tool with tight version control integration.
https://weblate.org/
GNU General Public License v3.0
4.54k stars 1.01k forks source link

perform_push task locks repository for a long time #9128

Closed jorrit closed 1 year ago

jorrit commented 1 year ago

Describe the issue

I update the units in my docker hosted weblate installation via the API by uploading partial gettext files. Sometimes, these requests fail due to an timeout. I've extracted a part of the logs from around the time this happens:

The perform_push task is created at 16:46:27,952. The do_update part is executed and acquires the lock at 16:46:27,981. The repository is up to date at 16:46:28,342. However, only at 16:48:28,103 the lock is released, 2 minutes later. It is immediately locked again, I assume for the push_repo part of the task.

It is hard to debug this problem because the individual steps in component.do_update are not logged. Perhaps debug level logging could be added to this method and the methods it calls. Also, I think it would be beneficial to add debug level logging to the execute() method of vcs/base, in order to understand which commands are executed and how long they took to execute.

Thank you for your consideration.

[2023-04-20 15:28:36,771: DEBUG/50] Acquiring Lock('lock:lock:gitlock') ...
[2023-04-20 15:28:36,773: INFO/50] Acquired Lock('lock:lock:gitlock').
[2023-04-20 15:28:36,775: DEBUG/50] Starting renewal thread for Lock('lock:lock:gitlock'). Refresh interval: 40.0 seconds.
[2023-04-20 15:28:36,777: DEBUG/50] Signaling renewal thread for Lock('lock:lock:gitlock') to exit.
[2023-04-20 15:28:36,777: DEBUG/50] Exiting renewal thread for Lock('lock:lock:gitlock').
[2023-04-20 15:28:36,777: DEBUG/50] Renewal thread for Lock('lock:lock:gitlock') exited.
[2023-04-20 15:28:36,778: DEBUG/50] Releasing Lock('lock:lock:gitlock').
gunicorn stderr | [2023-04-20 16:46:23,691: INFO/381] mycompany/productdata: running source checks for 0 strings
gunicorn stderr | [2023-04-20 16:46:23,694: DEBUG/381] Signaling renewal thread for Lock('lock:component-update-lock-5') to exit.
gunicorn stderr | [2023-04-20 16:46:23,695: DEBUG/381] Exiting renewal thread for Lock('lock:component-update-lock-5').
gunicorn stderr | [2023-04-20 16:46:23,696: DEBUG/381] Renewal thread for Lock('lock:component-update-lock-5') exited.
gunicorn stderr | [2023-04-20 16:46:23,696: DEBUG/381] Releasing Lock('lock:component-update-lock-5').
celery-notify stderr | [2023-04-20 16:46:24,145: INFO/MainProcess] Task weblate.accounts.tasks.notify_change[8ff7fb40-bcf2-429e-831c-da7ce697a3da] received
gunicorn stderr | [2023-04-20 16:46:24,147: INFO/381] mycompany/productdata: updating stats caches
celery-celery stderr | [2023-04-20 16:46:24,175: INFO/MainProcess] Task weblate.trans.tasks.update_component_stats[1c0f6f38-cc03-4703-8944-a6836d31fc39] received
celery-celery stderr | [2023-04-20 16:46:24,215: INFO/MainProcess] Task weblate.glossary.tasks.sync_glossary_languages[441fab83-0304-4d8b-95eb-0fc4af3fe81b] received
celery-celery stderr | [2023-04-20 16:46:24,245: INFO/MainProcess] Task weblate.checks.tasks.batch_update_checks[7d748580-e24e-43fe-8f67-cd66191eb000] received
gunicorn stderr | [2023-04-20 16:46:24,916: DEBUG/381] Acquiring Lock('lock:lock:repo:5') ...
gunicorn stderr | [2023-04-20 16:46:24,946: INFO/381] Acquired Lock('lock:lock:repo:5').
gunicorn stderr | [2023-04-20 16:46:24,948: DEBUG/381] Starting renewal thread for Lock('lock:lock:repo:5'). Refresh interval: 40.0 seconds.
gunicorn stderr | [2023-04-20 16:46:24,966: INFO/381] mycompany/productdata/nl: committing pending changes (pending unit)
gunicorn stderr | [2023-04-20 16:46:26,651: DEBUG/381] mycompany/productdata/nl: running store_post_load add-on: weblate.gettext.customize
gunicorn stderr | [2023-04-20 16:46:26,771: DEBUG/381] mycompany/productdata/nl: completed store_post_load add-on: weblate.gettext.customize
gunicorn stderr | [2023-04-20 16:46:27,219: INFO/381] mycompany/productdata/nl: committed ['/app/data/vcs/mycompany/productdata/nl.po'] as Anonymous <noreply@weblate.org>
gunicorn stderr | [2023-04-20 16:46:27,371: INFO/381] mycompany/productdata/nl: committed ['/app/data/vcs/mycompany/productdata/nl.po'] as "Camper.nl admin" <bot-mycompany-campernl-admin@bots.noreply.weblate.org>
gunicorn stderr | [2023-04-20 16:46:27,380: INFO/381] mycompany/productdata/de: committing pending changes (pending unit)
gunicorn stderr | [2023-04-20 16:46:27,483: DEBUG/381] mycompany/productdata/de: running store_post_load add-on: weblate.gettext.customize
gunicorn stderr | [2023-04-20 16:46:27,484: DEBUG/381] mycompany/productdata/de: completed store_post_load add-on: weblate.gettext.customize
gunicorn stderr | [2023-04-20 16:46:27,766: INFO/381] mycompany/productdata/de: committed ['/app/data/vcs/mycompany/productdata/de.po'] as Anonymous <noreply@weblate.org>
gunicorn stderr | [2023-04-20 16:46:27,920: INFO/381] mycompany/productdata/de: committed ['/app/data/vcs/mycompany/productdata/de.po'] as "Camper.nl admin" <bot-mycompany-campernl-admin@bots.noreply.weblate.org>
gunicorn stderr | [2023-04-20 16:46:27,926: DEBUG/381] Signaling renewal thread for Lock('lock:lock:repo:5') to exit.
gunicorn stderr | [2023-04-20 16:46:27,926: DEBUG/381] Exiting renewal thread for Lock('lock:lock:repo:5').
gunicorn stderr | [2023-04-20 16:46:27,926: DEBUG/381] Renewal thread for Lock('lock:lock:repo:5') exited.
gunicorn stderr | [2023-04-20 16:46:27,926: DEBUG/381] Releasing Lock('lock:lock:repo:5').
gunicorn stderr | [2023-04-20 16:46:27,928: INFO/381] mycompany/productdata: checking triggered alerts
gunicorn stderr | [2023-04-20 16:46:27,943: INFO/381] mycompany/productdata: scheduling push
celery-celery stderr | [2023-04-20 16:46:27,952: INFO/MainProcess] Task weblate.trans.tasks.perform_push[1e4022d0-9de6-4ffb-bdc6-13dae9f9cec4] received
celery-celery stderr | [2023-04-20 16:46:27,981: DEBUG/399] Acquiring Lock('lock:lock:repo:5') ...
celery-celery stderr | [2023-04-20 16:46:27,981: DEBUG/ForkPoolWorker-1] Acquiring Lock('lock:lock:repo:5') ...
celery-celery stderr | [2023-04-20 16:46:27,984: INFO/399] Acquired Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:46:27,984: INFO/ForkPoolWorker-1] Acquired Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:46:27,985: DEBUG/399] Starting renewal thread for Lock('lock:lock:repo:5'). Refresh interval: 40.0 seconds.
celery-celery stderr | [2023-04-20 16:46:27,985: DEBUG/ForkPoolWorker-1] Starting renewal thread for Lock('lock:lock:repo:5'). Refresh interval: 40.0 seconds.
celery-celery stderr | [2023-04-20 16:46:27,991: INFO/399] mycompany/productdata: updating repository
celery-celery stderr | [2023-04-20 16:46:27,991: INFO/ForkPoolWorker-1] mycompany/productdata: updating repository
gunicorn stderr | [2023-04-20 16:46:28,013: DEBUG/381] Acquiring Lock('lock:lock:repo:5') ...
celery-celery stderr | [2023-04-20 16:46:28,336: INFO/399] mycompany/productdata: update took 0.34 seconds
celery-celery stderr | [2023-04-20 16:46:28,336: INFO/ForkPoolWorker-1] mycompany/productdata: update took 0.34 seconds
celery-celery stderr | [2023-04-20 16:46:28,342: INFO/399] mycompany/productdata: repository up to date at a18c3e1ade744565307e832a192fb70157d72680
celery-celery stderr | [2023-04-20 16:46:28,342: INFO/ForkPoolWorker-1] mycompany/productdata: repository up to date at a18c3e1ade744565307e832a192fb70157d72680
gunicorn stderr | [2023-04-20 16:46:37,147: INFO/382] received Azure notification on repository mycomponent, URL http://tfs/tfs/TourOperators/mycompany/_git/mycomponent, branch refs/tags/mycomponent_master_1.0.0.407, 0 matching components, 0 to process, 0 linked
celery-celery stderr | [2023-04-20 16:47:07,988: DEBUG/399] Refreshing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:47:07,988: DEBUG/ForkPoolWorker-1] Refreshing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:47:21,514: INFO/MainProcess] Task weblate.utils.tasks.heartbeat[e22899ba-e120-4350-995b-6b6f04a7ee2e] received
celery-celery stderr | [2023-04-20 16:47:48,009: DEBUG/399] Refreshing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:47:48,009: DEBUG/ForkPoolWorker-1] Refreshing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:21,516: INFO/MainProcess] Task weblate.utils.tasks.heartbeat[6752586c-becb-4ab1-ae45-c44c67058985] received
celery-celery stderr | [2023-04-20 16:48:28,010: DEBUG/399] Refreshing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,010: DEBUG/ForkPoolWorker-1] Refreshing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,103: DEBUG/399] Signaling renewal thread for Lock('lock:lock:repo:5') to exit.
celery-celery stderr | [2023-04-20 16:48:28,103: DEBUG/ForkPoolWorker-1] Signaling renewal thread for Lock('lock:lock:repo:5') to exit.
celery-celery stderr | [2023-04-20 16:48:28,103: DEBUG/399] Exiting renewal thread for Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,103: DEBUG/ForkPoolWorker-1] Exiting renewal thread for Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,103: DEBUG/399] Renewal thread for Lock('lock:lock:repo:5') exited.
celery-celery stderr | [2023-04-20 16:48:28,103: DEBUG/ForkPoolWorker-1] Renewal thread for Lock('lock:lock:repo:5') exited.
celery-celery stderr | [2023-04-20 16:48:28,103: DEBUG/399] Releasing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,103: DEBUG/ForkPoolWorker-1] Releasing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,114: DEBUG/399] Acquiring Lock('lock:lock:repo:5') ...
celery-celery stderr | [2023-04-20 16:48:28,114: DEBUG/ForkPoolWorker-1] Acquiring Lock('lock:lock:repo:5') ...
celery-celery stderr | [2023-04-20 16:48:28,116: INFO/399] Acquired Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,116: INFO/ForkPoolWorker-1] Acquired Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,116: DEBUG/399] Starting renewal thread for Lock('lock:lock:repo:5'). Refresh interval: 40.0 seconds.
celery-celery stderr | [2023-04-20 16:48:28,116: DEBUG/ForkPoolWorker-1] Starting renewal thread for Lock('lock:lock:repo:5'). Refresh interval: 40.0 seconds.
celery-celery stderr | [2023-04-20 16:48:28,118: INFO/399] mycompany/productdata: pushing to remote repo
celery-celery stderr | [2023-04-20 16:48:28,118: INFO/ForkPoolWorker-1] mycompany/productdata: pushing to remote repo
gunicorn stderr | [2023-04-20 16:48:28,210: WARNING/381] Upload error: WeblateLockTimeout: Lock could not be acquired in 120s
gunicorn stderr | [2023-04-20 16:48:28,216: ERROR/381] Upload error
celery-celery stderr | [2023-04-20 16:48:28,685: DEBUG/399] Signaling renewal thread for Lock('lock:lock:repo:5') to exit.
celery-celery stderr | [2023-04-20 16:48:28,685: DEBUG/ForkPoolWorker-1] Signaling renewal thread for Lock('lock:lock:repo:5') to exit.
celery-celery stderr | [2023-04-20 16:48:28,685: DEBUG/399] Exiting renewal thread for Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,685: DEBUG/ForkPoolWorker-1] Exiting renewal thread for Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,686: DEBUG/399] Renewal thread for Lock('lock:lock:repo:5') exited.
celery-celery stderr | [2023-04-20 16:48:28,686: DEBUG/ForkPoolWorker-1] Renewal thread for Lock('lock:lock:repo:5') exited.
celery-celery stderr | [2023-04-20 16:48:28,686: DEBUG/399] Releasing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:28,686: DEBUG/ForkPoolWorker-1] Releasing Lock('lock:lock:repo:5').
celery-notify stderr | [2023-04-20 16:48:28,698: INFO/MainProcess] Task weblate.accounts.tasks.notify_change[e9bf3da6-7138-40d0-86bc-8e0f59c22f2f] received
gunicorn stderr | [2023-04-20 16:48:36,251: INFO/381] received Azure notification on repository productdata-translations, URL http://tfs01.citar.local:8080/tfs/TourOperators/mycompany/_git/productdata-translations, branch main, 1 matching components, 1 to process, 0 linked
gunicorn stderr | [2023-04-20 16:48:36,251: INFO/381] Azure notification will update mycompany/Product data
celery-notify stderr | [2023-04-20 16:48:36,262: INFO/MainProcess] Task weblate.accounts.tasks.notify_change[c32b4272-4a87-4be3-88d0-32ad38e856f5] received
celery-celery stderr | [2023-04-20 16:48:36,276: INFO/MainProcess] Task weblate.trans.tasks.perform_update[cc63c220-95ca-438a-887f-19b5c250264d] received
celery-celery stderr | [2023-04-20 16:48:36,324: DEBUG/399] Acquiring Lock('lock:lock:repo:5') ...
celery-celery stderr | [2023-04-20 16:48:36,324: DEBUG/ForkPoolWorker-1] Acquiring Lock('lock:lock:repo:5') ...
celery-celery stderr | [2023-04-20 16:48:36,327: INFO/399] Acquired Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:36,327: INFO/ForkPoolWorker-1] Acquired Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:36,328: DEBUG/399] Starting renewal thread for Lock('lock:lock:repo:5'). Refresh interval: 40.0 seconds.
celery-celery stderr | [2023-04-20 16:48:36,328: DEBUG/ForkPoolWorker-1] Starting renewal thread for Lock('lock:lock:repo:5'). Refresh interval: 40.0 seconds.
celery-celery stderr | [2023-04-20 16:48:36,332: INFO/399] mycompany/productdata: updating repository
celery-celery stderr | [2023-04-20 16:48:36,332: INFO/ForkPoolWorker-1] mycompany/productdata: updating repository
celery-celery stderr | [2023-04-20 16:48:36,740: INFO/399] mycompany/productdata: update took 0.41 seconds
celery-celery stderr | [2023-04-20 16:48:36,740: INFO/ForkPoolWorker-1] mycompany/productdata: update took 0.41 seconds
celery-celery stderr | [2023-04-20 16:48:36,747: INFO/399] mycompany/productdata: repository up to date at d1f543db50e2fd0ff561d4fe608cb76ab16c76da
celery-celery stderr | [2023-04-20 16:48:36,747: INFO/ForkPoolWorker-1] mycompany/productdata: repository up to date at d1f543db50e2fd0ff561d4fe608cb76ab16c76da
celery-celery stderr | [2023-04-20 16:48:36,775: DEBUG/399] Signaling renewal thread for Lock('lock:lock:repo:5') to exit.
celery-celery stderr | [2023-04-20 16:48:36,775: DEBUG/ForkPoolWorker-1] Signaling renewal thread for Lock('lock:lock:repo:5') to exit.
celery-celery stderr | [2023-04-20 16:48:36,775: DEBUG/399] Exiting renewal thread for Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:36,775: DEBUG/ForkPoolWorker-1] Exiting renewal thread for Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:36,776: DEBUG/399] Renewal thread for Lock('lock:lock:repo:5') exited.
celery-celery stderr | [2023-04-20 16:48:36,776: DEBUG/ForkPoolWorker-1] Renewal thread for Lock('lock:lock:repo:5') exited.
celery-celery stderr | [2023-04-20 16:48:36,776: DEBUG/399] Releasing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:48:36,776: DEBUG/ForkPoolWorker-1] Releasing Lock('lock:lock:repo:5').
celery-celery stderr | [2023-04-20 16:49:21,519: INFO/MainProcess] Task weblate.utils.tasks.heartbeat[0a55df0b-4ee7-4943-8e2e-10faf2adf4e5] received

I already tried

Steps to reproduce the behavior

Invoking the /file/ endpoint of a single component multiple times in with small updates and short time in between seems to trigger this problem.

Expected behavior

The repository lock should not be held for two minutes.

Screenshots

No response

Exception traceback

No response

How do you run Weblate?

Docker container

Weblate versions

weblate@3d2e89b9ce7d:/$ weblate list_versions

Weblate deploy checks

weblate@3d2e89b9ce7d:/$ weblate check --deploy
System check identified some issues:

INFOS:
?: (weblate.I028) Backups are not configured, it is highly recommended for production use
        HINT: https://docs.weblate.org/en/weblate-4.17/admin/backup.html

Additional context

No response

nijel commented 1 year ago

If you want more fine-grained insight into performance, I suggest using Sentry – it collects timing of most of the operations (command executions, SQL queries, cache accesses).

Looking at our Sentry performance data, it turns out that alert handling might be problematic here with many linked components. Is that what you are using? I've created https://github.com/WeblateOrg/weblate/issues/9139 to cover this.

nijel commented 1 year ago

Are you using GitHub pull requests integration (or for other code hosting service)? In that case, lowering VCS_API_DELAY might improve the situation.

jorrit commented 1 year ago

Thanks for your answers. I am not using linked components and the VCS integration is with a local MS DevOps installation.

I have enabled Sentry integration. However, I did not set SENTRY_TRACES_SAMPLE_RATE so I didn't see any performance statistics. I'll set it now and wait until I get this error again. Maybe it is worthwile to document SENTRY_TRACES_SAMPLE_RATE.

Does the sentry integration also log the performance of git invocations? I am curious to learn those statistics.

nijel commented 1 year ago

It traces the performance of all operations, see https://docs.sentry.io/product/performance/ for their docs.

jorrit commented 1 year ago

I got a result. For some reason the code hangs at an SQL query:

image

jorrit commented 1 year ago

Another instance of this slow down happened at a later moment. Again this seemingly trivial query took 2 minutes. My trans_component table contains 5 rows.

image

nijel commented 1 year ago

Maybe tweaking PostgreSQL configuration will help? See https://docs.weblate.org/en/latest/admin/install/docker.html#configuring-postgresql-server

jorrit commented 1 year ago

I'll try to see if I can find out whether something is locking the table. When I run the same kind of UPDATE query while the lock is held, it also takes more than a minute to execute. Outside those times, just a few ms.

I use a separate postgresql server, but it is also in a Docker container.

jorrit commented 1 year ago

I used some queries from https://wiki.postgresql.org/wiki/Lock_Monitoring to find the source of the locks.

These are the processes at the moment when the timeouts occur: pg activity.csv

The query from the link above give 7526 as the blocked pid and 7524 is the blocking pid. The blocked statement was UPDATE "trans_component" SET "remote_revision" = '570f4d9cf7ca79c483a2d924bb720c8ca8f0d3c0' WHERE "trans_component"."id" = 5. The blocking statement was SELECT "trans_component"."id", "trans_component"."name", "trans_component"."slug", "trans_component"."project_id", "trans_component"."vcs", "trans_component"."repo", "trans_component"."linked_component_id", "trans_component"."push", "trans_component"."repoweb", "trans_component"."git_export", "trans_component"."report_source_bugs", "trans_component"."branch", "trans_component"."push_branch", "trans_component"."filemask", "trans_component"."template", "trans_component"."edit_template", "trans_component"."intermediate", "trans_component"."new_base", "trans_component"."file_format", "trans_component"."locked", "trans_component"."allow_translation_propagation", "trans_component"."enable_suggestions", "trans_component"."suggestion_voting", "trans_component"."suggestion_autoaccept", "trans_component"."check_flags", "trans_component"."enforced_checks", "trans_component"."license", "trans_component"."agreement", "trans_component"."new_lang", "trans_component"."language_code_style", "trans_component"."manage_units",.

I suspect that this is a classical race condition: there are two kinds of locks: the Weblate lock in Redis and the table lock in PostgreSQL. The Celery task holds the Redis lock and needs the PG lock while another proces has the PG lock and wants the Redis lock. It seems that that other process might be another perform_push task.

I hope you can find a solution.

nijel commented 1 year ago

Thanks for detailed analysis, I think now see where the issue is.

github-actions[bot] commented 1 year ago

Thank you for your report; the issue you have reported has just been fixed.

jorrit commented 1 year ago

Děkuji!