Closed jorrit closed 1 year ago
If you want more fine-grained insight into performance, I suggest using Sentry – it collects timing of most of the operations (command executions, SQL queries, cache accesses).
Looking at our Sentry performance data, it turns out that alert handling might be problematic here with many linked components. Is that what you are using? I've created https://github.com/WeblateOrg/weblate/issues/9139 to cover this.
Are you using GitHub pull requests integration (or for other code hosting service)? In that case, lowering VCS_API_DELAY might improve the situation.
Thanks for your answers. I am not using linked components and the VCS integration is with a local MS DevOps installation.
I have enabled Sentry integration. However, I did not set SENTRY_TRACES_SAMPLE_RATE
so I didn't see any performance statistics. I'll set it now and wait until I get this error again. Maybe it is worthwile to document SENTRY_TRACES_SAMPLE_RATE
.
Does the sentry integration also log the performance of git invocations? I am curious to learn those statistics.
It traces the performance of all operations, see https://docs.sentry.io/product/performance/ for their docs.
I got a result. For some reason the code hangs at an SQL query:
Another instance of this slow down happened at a later moment. Again this seemingly trivial query took 2 minutes. My trans_component table contains 5 rows.
Maybe tweaking PostgreSQL configuration will help? See https://docs.weblate.org/en/latest/admin/install/docker.html#configuring-postgresql-server
I'll try to see if I can find out whether something is locking the table. When I run the same kind of UPDATE query while the lock is held, it also takes more than a minute to execute. Outside those times, just a few ms.
I use a separate postgresql server, but it is also in a Docker container.
I used some queries from https://wiki.postgresql.org/wiki/Lock_Monitoring to find the source of the locks.
These are the processes at the moment when the timeouts occur: pg activity.csv
The query from the link above give 7526 as the blocked pid and 7524 is the blocking pid.
The blocked statement was UPDATE "trans_component" SET "remote_revision" = '570f4d9cf7ca79c483a2d924bb720c8ca8f0d3c0' WHERE "trans_component"."id" = 5
.
The blocking statement was SELECT "trans_component"."id", "trans_component"."name", "trans_component"."slug", "trans_component"."project_id", "trans_component"."vcs", "trans_component"."repo", "trans_component"."linked_component_id", "trans_component"."push", "trans_component"."repoweb", "trans_component"."git_export", "trans_component"."report_source_bugs", "trans_component"."branch", "trans_component"."push_branch", "trans_component"."filemask", "trans_component"."template", "trans_component"."edit_template", "trans_component"."intermediate", "trans_component"."new_base", "trans_component"."file_format", "trans_component"."locked", "trans_component"."allow_translation_propagation", "trans_component"."enable_suggestions", "trans_component"."suggestion_voting", "trans_component"."suggestion_autoaccept", "trans_component"."check_flags", "trans_component"."enforced_checks", "trans_component"."license", "trans_component"."agreement", "trans_component"."new_lang", "trans_component"."language_code_style", "trans_component"."manage_units",
.
I suspect that this is a classical race condition: there are two kinds of locks: the Weblate lock in Redis and the table lock in PostgreSQL. The Celery task holds the Redis lock and needs the PG lock while another proces has the PG lock and wants the Redis lock. It seems that that other process might be another perform_push
task.
I hope you can find a solution.
Thanks for detailed analysis, I think now see where the issue is.
Thank you for your report; the issue you have reported has just been fixed.
Děkuji!
Describe the issue
I update the units in my docker hosted weblate installation via the API by uploading partial gettext files. Sometimes, these requests fail due to an timeout. I've extracted a part of the logs from around the time this happens:
The
perform_push
task is created at16:46:27,952
. Thedo_update
part is executed and acquires the lock at16:46:27,981
. The repository is up to date at16:46:28,342
. However, only at16:48:28,103
the lock is released, 2 minutes later. It is immediately locked again, I assume for thepush_repo
part of the task.It is hard to debug this problem because the individual steps in
component.do_update
are not logged. Perhaps debug level logging could be added to this method and the methods it calls. Also, I think it would be beneficial to add debug level logging to theexecute()
method ofvcs/base
, in order to understand which commands are executed and how long they took to execute.Thank you for your consideration.
I already tried
Steps to reproduce the behavior
Invoking the
/file/
endpoint of a single component multiple times in with small updates and short time in between seems to trigger this problem.Expected behavior
The repository lock should not be held for two minutes.
Screenshots
No response
Exception traceback
No response
How do you run Weblate?
Docker container
Weblate versions
weblate@3d2e89b9ce7d:/$ weblate list_versions
Weblate deploy checks
Additional context
No response