Problems with Local Invalidations

FernandoGarzon commented 2 years ago

The functions client.declare_bad_file_replicas(pfns, reason) and client.declare_bad_did_replicas(rse, dids, reason) throw the following exception/error:

rucio.declare_bad_did_replicas(rse=rse, dids=[{'scope':'cms', 'name':name}], reason=reason)
  File "/cvmfs/cms.cern.ch/rucio/x86_64/rhel7/py3/current/lib/python3.6/site-packages/rucio/client/replicaclient.py", line 63, in declare_bad_did_replicas
    raise exc_cls(exc_msg)
rucio.common.exception.RucioException: An unknown exception occurred.
Details: no error information passed (http status code: 500)

Similarly, using Jupyter Notebook:

~/RucioClientJup/rucjup/lib64/python3.6/site-packages/rucio/client/replicaclient.py in declare_bad_file_replicas(self, pfns, reason)
     63             return loads(r.text)
     64         exc_cls, exc_msg = self._get_exception(headers=r.headers, status_code=r.status_code, data=r.content)
---> 65         raise exc_cls(exc_msg)
     66 
     67     def declare_bad_did_replicas(self, rse, dids, reason):

RucioException: An unknown exception occurred.
Details: no error information passed (http status code: 500)

Checking the logs of the following pods in rucio k8s cluster (rucio-server container):

server-rucio-server-fbb969f46-dk6q7
server-rucio-server-fbb969f46-mnv5g
server-rucio-server-fbb969f46-t78zd

We find that, for the 1st and 3rd pod:

[2022-06-15 16:33:18] 2001:4860::2:feca 188.185.87.84 188.185.17.172 YqoJznfvxgoDBY-JJALeHAAAAXM 200 726 21 2072 "GET /ping HTTP/1.1" "transfer_ops-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ogarzonm/CN=846074/CN=Oscar Fernando Garzon Miguez-unknown-3554076b155e44be8b86af05385326ba""rucio-clients/1.28.3" invalidations.py::/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_7_0_4_START70_V7-v1/00000/FA2381FE-13CD-E311-B009-02163E00F2B6.root

and for the 2nd pod:

[2022-06-15 16:33:18] 2001:4860::4:57e0 10.100.23.0 188.185.17.172 YqoJzuRLuVmLlnYv3YMMRQAAAVo 500 957 43 20324 "POST /replicas/bad/dids HTTP/1.1" "transfer_ops-/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ogarzonm/CN=846074/CN=Oscar Fernando Garzon Miguez-unknown-3554076b155e44be8b86af05385326ba" "rucio-clients/1.28.3" invalidations.py::/store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_7_0_4_START70_V7-v1/00000/FA2381FE-13CD-E311-B009-02163E00F2B6.root

I don't find any error logs in the httpd-error-log container regarding this file or my account (ogarzonm).

In the case of the 1st error log, note the "GET" function there, whereas, for the second log, we can see the "POST /replicas/bad/dids HTTP/1.1" message, which I think is actually an attempt of the server to actually label the did as bad. However, this change never propagates, and it keeps showing the 'no error information passed (http status code: 500)' message.

The 'POST' command is actually called inside both functions declare_bad_file_replicas() and client.declare_bad_did_replicas(), as seen in here: Here, but not the GET command.

Is there a way we can get this fixed so we can invalidate stuff again?

Thank you.

klannon commented 2 years ago

@dciangot, since Eric is on holiday this week, can you take a look at this?

dciangot commented 2 years ago

sure, @FernandoGarzon I'll need some details to reproduce the problem. In particular I think that reason and rse would be enough

FernandoGarzon commented 2 years ago

For sure.

The last couple of times I tried to run a Script that called this function, I wrote this:

#!/usr/bin/env python

import sys
import traceback
import os
import multiprocessing as mp
from rucio.client import Client
rucio = Client()
rucio.ping()

def invalidate_file(rse, name, reason):
    rucio.declare_bad_did_replicas(rse=rse, dids=[{'scope':'cms', 'name':name}], reason=reason)
    print('Done!')

if __name__ == '__main__':
    rucio.ping()
    rse = sys.argv[2]
    reason = sys.argv[3]
    name = sys.argv[1]
    print(name, rse, reason)
    invalidate_file(rse, name, reason)

In the console I run:

$ python3 invalidations.py /store/mc/HC/GenericTTbar/GEN-SIM-RECO/CMSSW_7_0_4_START70_V7-v1/00000/FA2381FE-13CD-E311-B009-02163E00F2B6.root T1_US_FNAL_Disk 'Test'

This used to work fine.

dciangot commented 2 years ago

Indeed looks like a bug of some sort (*). I'm diving into the code base to be more precise. brb

(*)

[Thu Jun 16 15:21:52.140898 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] Traceback (most recent call last):
[Thu Jun 16 15:21:52.140902 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]   File "/usr/local/lib/python3.6/site-packages/rucio/web/rest/flaskapi/v1/common.py", line 55, in dispatch_request
[Thu Jun 16 15:21:52.140906 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]     return super(ErrorHandlingMethodView, self).dispatch_request(*args, **kwargs)
[Thu Jun 16 15:21:52.140909 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]   File "/usr/local/lib/python3.6/site-packages/flask/views.py", line 158, in dispatch_request
[Thu Jun 16 15:21:52.140911 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]     return current_app.ensure_sync(meth)(*args, **kwargs)
[Thu Jun 16 15:21:52.140914 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]   File "/usr/local/lib/python3.6/site-packages/rucio/web/rest/flaskapi/v1/replicas.py", line 793, in post
[Thu Jun 16 15:21:52.140917 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]     vo=request.environ.get('vo'),
[Thu Jun 16 15:21:52.140919 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]   File "/usr/local/lib/python3.6/site-packages/rucio/db/sqla/session.py", line 365, in new_funct
[Thu Jun 16 15:21:52.140922 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]     result = function(*args, **kwargs)
[Thu Jun 16 15:21:52.140924 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]   File "/usr/local/lib/python3.6/site-packages/rucio/api/replica.py", line 462, in add_bad_dids
[Thu Jun 16 15:21:52.140927 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]     return replica.add_bad_dids(dids=dids, rse_id=rse_id, reason=reason, issuer=issuer, state=state, session=session)
[Thu Jun 16 15:21:52.140929 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]   File "/usr/local/lib/python3.6/site-packages/rucio/db/sqla/session.py", line 379, in new_funct
[Thu Jun 16 15:21:52.140932 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]     result = function(*args, **kwargs)
[Thu Jun 16 15:21:52.140934 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]   File "/usr/local/lib/python3.6/site-packages/rucio/core/replica.py", line 359, in add_bad_dids
[Thu Jun 16 15:21:52.140937 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946]     list_replicas.append((scope, name, None))
[Thu Jun 16 15:21:52.140941 2022] [wsgi:error] [pid 142922:tid 140110672299776] [remote 10.100.20.128:54946] AttributeError: 'function' object has no attribute 'append'

dciangot commented 2 years ago

still to be understand how we can mitigate it

dciangot commented 2 years ago

From Martin on slack support channel:

Will try to add it to the next release, in the meanwhile, you can also fix this manually in the deployment. It’s just changing list_replicas.append to replicas_list.append  here

I'll give it a try on this asap

dciangot commented 2 years ago

@ericvaandering I couldn't figure out where the current rucio-server image is built, can you share a pointer for reference?

I know that @FernandoGarzon should have been able to workaroud this with a live patch on the running servers I think.

ericvaandering commented 2 years ago

It's here: https://github.com/dmwm/CMSKubernetes/tree/master/docker/rucio-server but will be moved to CMSRucio repository "soon"

I don't know that editing the python in the containers would work since generally python is compiled when it's run and changes don't get propagated to running code.

If there is a PR I can add a patch to the code.

FernandoGarzon commented 2 years ago

I added a PR: https://github.com/rucio/rucio/pull/5662/files

I hope it is of some use.

ericvaandering commented 2 years ago

https://github.com/rucio/rucio/pull/5650/files is there too

Problem is I can't get this to apply and run correctly. I really don't understand why. I'll look again tomorrow.

FernandoGarzon commented 2 years ago

Hello

The invalidation function is working fine, @ericvaandering. Thanks.

There is a very similar issue with the add_replicas() function: Link (I don't think is extremely necessary to create a new GH Issue for this).

When running it I get:

client.add_replicas(rse = rse, files = [{'scope': 'cms', 'name': name}], ignore_availability=True)

RucioException                            Traceback (most recent call last)
<ipython-input-7-2032a908ce5c> in <module>
----> 1 client.add_replicas(rse = rse, files = [{'scope': 'cms', 'name': name}], ignore_availability=True)

~/RucioClientJup/rucjup/lib64/python3.6/site-packages/rucio/client/replicaclient.py in add_replicas(self, rse, files, ignore_availability)
    265             return True
    266         exc_cls, exc_msg = self._get_exception(headers=r.headers, status_code=r.status_code, data=r.content)
--> 267         raise exc_cls(exc_msg)
    268 
    269     def delete_replicas(self, rse, files, ignore_availability=True):

RucioException: An unknown exception occurred.
Details: no error information passed (http status code: 500)

where

name = '/store/mc/RunIISummer20UL16NanoAODv9/NMSSM_XToYHTo2B2G_MX-1200_MY-550_TuneCP5_13TeV-madgraph-pythia8/NANOAODSIM/106X_mcRun2_asymptotic_v17-v3/80000/0CD0AC87-A28C-3A4F-B0D8-86145272742F.root'

rse = 'T2_US_Vanderbilt'

ignore_availability was set to False and True, and it gives the same result.

That is a similar error we got when attempting to invalidate.

I'm using this in order to figure out what's happening at Nebraska: https://ggus.eu/index.php?mode=ticket_info&ticket_id=157804

There are many unique replicas at Nebraska according to rucio that are missing at the site. I found these replicas at Vanderbilt using:

$ xrdfs cms-xrd-global.cern.ch locate -d -m /store/mc/RunIISummer20UL16MiniAODAPVv2/GkkTogRadionTogVV_MGkk4000-MR3200_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_preVFP_v11-v3/80000/64AA333B-8B5D-4C4D-925A-B8B6A6EA8477.root 
xrootd-se24-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite 
xrootd-se1-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite 
xrootd-hv1-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite 
xrootd-se7-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite 
xrootd-se27-vanderbilt.sites.opensciencegrid.org:1094 Server ReadWrite
...

I also run the same function with rse = 'T1_US_FNAL_Disk' and it gives 'True', meaning that a replica at FNAL for this file was added in rucio, but when checking the replicas of the file in the rucio cli, FNAL isn't shown there, so I think the function is failing to add the replica there.

This doesn't solve the root problem which could be a data loss at Nebraska but could probably help us retrieve the lost files, reason why I need the add_replicas() function.

Let me know please if I am interpreting things wrong here.

Thanks.

dmwm / CMSRucio

Problems with Local Invalidations #294