Closed FernandoGarzon closed 1 year ago
@dciangot, since Eric is on holiday this week, can you take a look at this?
sure, @FernandoGarzon I'll need some details to reproduce the problem. In particular I think that reason
and rse
would be enough
For sure.
The last couple of times I tried to run a Script that called this function, I wrote this:
#!/usr/bin/env python
import sys
import traceback
import os
import multiprocessing as mp
from rucio.client import Client
rucio = Client()
rucio.ping()
def invalidate_file(rse, name, reason):
rucio.declare_bad_did_replicas(rse=rse, dids=[{'scope':'cms', 'name':name}], reason=reason)
print('Done!')
if __name__ == '__main__':
rucio.ping()
rse = sys.argv[2]
reason = sys.argv[3]
name = sys.argv[1]
print(name, rse, reason)
invalidate_file(rse, name, reason)
In the console I run:
$ python3 invalidations.py /store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_7_0_4_START70_V7-v1/00000/FA2381FE-13CD-E311-B009-02163E00F2B6.root T1_US_FNAL_Disk 'Test'
This used to work fine.
Indeed looks like a bug of some sort (*). I'm diving into the code base to be more precise. brb
(*)
[Thu Jun 16 15:21:52.140898 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] Traceback (most recent call last):
[Thu Jun 16 15:21:52.140902 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] File "/usr/local/lib/python3.6/site-packages/rucio/web/rest/flaskapi/v1/common.py", line 55, in dispatch_request
[Thu Jun 16 15:21:52.140906 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] return super(ErrorHandlingMethodView, self).dispatch_request(*args, **kwargs)
[Thu Jun 16 15:21:52.140909 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 158, in dispatch_request
[Thu Jun 16 15:21:52.140911 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] return current_app.ensure_sync(meth)(*args, **kwargs)
[Thu Jun 16 15:21:52.140914 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] File "/usr/local/lib/python3.6/site-packages/rucio/web/rest/flaskapi/v1/replicas.py", line 793, in post
[Thu Jun 16 15:21:52.140917 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] vo=request.environ.get('vo'),
[Thu Jun 16 15:21:52.140919 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] File "/usr/local/lib/python3.6/site-packages/rucio/db/sqla/session.py", line 365, in new_funct
[Thu Jun 16 15:21:52.140922 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] result = function(*args, **kwargs)
[Thu Jun 16 15:21:52.140924 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] File "/usr/local/lib/python3.6/site-packages/rucio/api/replica.py", line 462, in add_bad_dids
[Thu Jun 16 15:21:52.140927 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] return replica.add_bad_dids(dids=dids, rse_id=rse_id, reason=reason, issuer=issuer, state=state, session=session)
[Thu Jun 16 15:21:52.140929 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] File "/usr/local/lib/python3.6/site-packages/rucio/db/sqla/session.py", line 379, in new_funct
[Thu Jun 16 15:21:52.140932 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] result = function(*args, **kwargs)
[Thu Jun 16 15:21:52.140934 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] File "/usr/local/lib/python3.6/site-packages/rucio/core/replica.py", line 359, in add_bad_dids
[Thu Jun 16 15:21:52.140937 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] list_replicas.append((scope, name, None))
[Thu Jun 16 15:21:52.140941 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] AttributeError: 'function' object has no attribute 'append'
related to this: https://github.com/rucio/rucio/issues/5642
still to be understand how we can mitigate it
From Martin on slack support channel:
Will try to add it to the next release, in the meanwhile, you can also fix this manually in the deployment. It’s just changing list_replicas.append to replicas_list.append here
I'll give it a try on this asap
@ericvaandering I couldn't figure out where the current rucio-server image is built, can you share a pointer for reference?
I know that @FernandoGarzon should have been able to workaroud this with a live patch on the running servers I think.
It's here: https://github.com/dmwm/CMSKubernetes/tree/master/docker/rucio-server but will be moved to CMSRucio repository "soon"
I don't know that editing the python in the containers would work since generally python is compiled when it's run and changes don't get propagated to running code.
If there is a PR I can add a patch to the code.
I added a PR: https://github.com/rucio/rucio/pull/5662/files
I hope it is of some use.
https://github.com/rucio/rucio/pull/5650/files is there too
Problem is I can't get this to apply and run correctly. I really don't understand why. I'll look again tomorrow.
Hello
The invalidation function is working fine, @ericvaandering. Thanks.
There is a very similar issue with the add_replicas() function: Link (I don't think is extremely necessary to create a new GH Issue for this).
When running it I get:
client.add_replicas(rse = rse, files = [{'scope': 'cms', 'name': name}], ignore_availability=True)
RucioException Traceback (most recent call last)
<ipython-input-7-2032a908ce5c> in <module>
----> 1 client.add_replicas(rse = rse, files = [{'scope': 'cms', 'name': name}], ignore_availability=True)
~/RucioClientJup/rucjup/lib64/python3.6/site-packages/rucio/client/replicaclient.py in add_replicas(self, rse, files, ignore_availability)
265 return True
266 exc_cls, exc_msg = self._get_exception(headers=r.headers, status_code=r.status_code, data=r.content)
--> 267 raise exc_cls(exc_msg)
268
269 def delete_replicas(self, rse, files, ignore_availability=True):
RucioException: An unknown exception occurred.
Details: no error information passed (http status code: 500)
where
name = '/store/mc/RunIISummer20UL16NanoAODv9/NMSSM_XToYHTo2B2G_MX-1200_MY-550_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v3/80000/0CD0AC87-A28C-3A4F-B0D8-86145272742F.root'
rse = 'T2_US_Vanderbilt'
ignore_availability was set to False and True, and it gives the same result.
That is a similar error we got when attempting to invalidate.
I'm using this in order to figure out what's happening at Nebraska: https://ggus.eu/index.php?mode=ticket_info&ticket_id=157804
There are many unique replicas at Nebraska according to rucio that are missing at the site. I found these replicas at Vanderbilt using:
$ xrdfs cms-xrd-global.cern.ch locate -d -m /store/mc/RunIISummer20UL16MiniAODAPVv2/GkkTogRadionTogVV_MGkk4000-MR3200_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_preVFP_v11-v3/80000/64AA333B-8B5D-4C4D-925A-B8B6A6EA8477.root
xrootd-se24-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite
xrootd-se1-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite
xrootd-hv1-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite
xrootd-se7-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite
xrootd-se27-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite
...
I also run the same function with rse = 'T1_US_FNAL_Disk' and it gives 'True', meaning that a replica at FNAL for this file was added in rucio, but when checking the replicas of the file in the rucio cli, FNAL isn't shown there, so I think the function is failing to add the replica there.
This doesn't solve the root problem which could be a data loss at Nebraska but could probably help us retrieve the lost files, reason why I need the add_replicas() function.
Let me know please if I am interpreting things wrong here.
Thanks.
The functions client.declare_bad_file_replicas(pfns, reason) and client.declare_bad_did_replicas(rse, dids, reason) throw the following exception/error:
Similarly, using Jupyter Notebook:
Checking the logs of the following pods in rucio k8s cluster (rucio-server container):
server-rucio-server-fbb969f46-dk6q7
server-rucio-server-fbb969f46-mnv5g
server-rucio-server-fbb969f46-t78zd
We find that, for the 1st and 3rd pod:
[2022-06-15 16:33:18] 2001:4860::2:feca 188.185.87.84 188.185.17.172 YqoJznfvxgoDBY-JJALeHAAAAXM 200 726 21 2072 "GET /ping HTTP/1.1" "transfer_ops-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ogarzonm/CN=846074/CN=Oscar Fernando Garzon Miguez-unknown-3554076b155e44be8b86af05385326ba""rucio-clients/1.28.3" invalidations.py::/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_7_0_4_START70_V7-v1/00000/FA2381FE-13CD-E311-B009-02163E00F2B6.root
and for the 2nd pod:
[2022-06-15 16:33:18] 2001:4860::4:57e0 10.100.23.0 188.185.17.172 YqoJzuRLuVmLlnYv3YMMRQAAAVo 500 957 43 20324 "POST /replicas/bad/dids HTTP/1.1" "transfer_ops-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ogarzonm/CN=846074/CN=Oscar Fernando Garzon Miguez-unknown-3554076b155e44be8b86af05385326ba" "rucio-clients/1.28.3" invalidations.py::/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_7_0_4_START70_V7-v1/00000/FA2381FE-13CD-E311-B009-02163E00F2B6.root
I don't find any error logs in the httpd-error-log container regarding this file or my account (ogarzonm).
In the case of the 1st error log, note the "GET" function there, whereas, for the second log, we can see the "POST /replicas/bad/dids HTTP/1.1" message, which I think is actually an attempt of the server to actually label the did as bad. However, this change never propagates, and it keeps showing the 'no error information passed (http status code: 500)' message.
The 'POST' command is actually called inside both functions declare_bad_file_replicas() and client.declare_bad_did_replicas(), as seen in here: Here, but not the GET command.
Is there a way we can get this fixed so we can invalidate stuff again?
Thank you.