dmwm / CMSRucio

7 stars 31 forks source link

Bug: delete-rule returns HTTP 504 (timeout) for some rules #787

Open haozturk opened 6 months ago

haozturk commented 6 months ago

Bug Description

UERJ site admin reported this problem and I was able to reproduce it with transfer_ops account. See the details below. Not all rule deletions fail:

[2024-04-29 12:45:46]    10.100.245.113    188.185.31.22    137.138.44.133    Zi-WegtBQFmMN72oWXvvMQAAAe8    200    676    0    9740    "DELETE /rules/8603bb9f74194f07a70e92906f5f10b8 HTTP/1.1"    "tier0_prod-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0-unknown-87178f21cbef497da6d3ca0bae992763"    "wmcore-client/1.29.10"    wmcoreD::--start
[2024-04-29 12:45:58]    10.100.245.113    10.100.61.128    152.92.255.251    Zi-WhgtBQFmMN72oWXvvOwAAAYg    504    682    247    60060086    "DELETE /rules/39a6afe366144e0db800f7bb67ba2cd8 HTTP/1.1"    "rvaladao-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=rvaladao/CN=783687/CN=Renan Bernardo Valadao-unknown-5f116761642e43dd98ac4b4e46ec9a36"    "rucio-clients/1.28.7"    rucio::delete-rule
[2024-04-29 12:49:44]    10.100.245.113    188.185.31.22    137.138.52.11    Zi-XaFHYoA_wJt9RrzO60AAAADE    200    678    0    28449    "DELETE /rules/295287aad56343d0a8c4d4caba783449 HTTP/1.1"    "tier0_replay-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=cmst0/CN=658085/CN=Robot: CMS Tier0-unknown-0adf0333b0104ac5a96e03881cf3dd97"wmcore-client/1.29.10"    wmcoreD::--start 

Reproduction Steps

[haozturk@lxplus936 dm]$ rucio delete-rule 39a6afe366144e0db800f7bb67ba2cd8
2024-04-29 15:08:49,887 WARNING Waiting 0.5s due to reason: server returned 504 
2024-04-29 15:11:50,393 WARNING Waiting 1.0s due to reason: server returned 504 
2024-04-29 15:11:51,394 ERROR   An unknown exception occurred.
$ rucio whoami
suspended_at : None
account    : transfer_ops
...

Expected Behavior

The rule deletion above should succeed.

Possible Solution

No response

Related Issues

No response

haozturk commented 6 months ago

Actually the expiry date of the rule is set properly, but the rule hasn't been cleaned even if it's expired 3 days ago:

$ rucio rule-info 39a6afe366144e0db800f7bb67ba2cd8
Id:                         39a6afe366144e0db800f7bb67ba2cd8
Account:                    rvaladao
Scope:                      user.rvaladao
Name:                       /UERJ/Datasets/USER
RSE Expression:             T2_BR_UERJ
Copies:                     1
State:                      STUCK
Locks OK/REPLICATING/STUCK: 3768/568949/685
Grouping:                   DATASET
Expires at:                 2024-04-26 19:21:33
Locked:                     False
Weight:                     None
Created at:                 2024-04-24 16:54:46
Updated at:                 2024-04-29 14:11:58
Error:                      RequestErrMsg.TRANSFER_FAILED:TRANSFER ERROR: Copy failed (3rd pull). Last attempt: Request cancellation was requested.
Subscription Id:            None
Source replica expression:  None
Activity:                   User Subscriptions
Comment:                    None
Ignore Quota:               False
Ignore Availability:        False
Purge replicas:             False
Notification:               NO
End of life:                None
Child Rule Id:              None
haozturk commented 6 months ago

This is what I see in judge-cleaner:

{"process": {"pid": 9}, "@timestamp": "2024-04-29T00:00:14.233Z", "message": "[7/8]: Deleting rule 39a6afe366144e0db800f7bb67ba2cd8 with expression T2_BR_UERJ", "error": {"type": null, "message": null, "stack_trace": null}, "log": {"level": "INFO", "logger": "root"}}
{"process": {"pid": 9}, "@timestamp": "2024-04-29T00:00:14.242Z", "message": "[7/8]: Locks detected for 39a6afe366144e0db800f7bb67ba2cd8", "error": {"type": null, "message": null, "stack_trace": null}, "log": {"level": "WARNING", "logger": "root"}}

Checking further to understand what Locks detected means

haozturk commented 6 months ago

This is ORA-00054 db error whose message is resource busy and acquire with nowait specified or timeout expired. I suspect this is related to the size of the rule (~0.5M files). It might be that the DB call is timing out or there is an active query locking the table. Will investigate further.