dmwm / CMSRucio

7 stars 31 forks source link

Declare file bad not working #530

Open KatyEllis opened 1 year ago

KatyEllis commented 1 year ago

Hi,

I've suspected this functionality was not working in the past but assumed it was fixed - perhaps not.

I found a corrupt file on RAL disk and deleted it manually (this was required due to the nature of the corruption). Rucio still thought the file was present (and we do not want to wait one week for the next consistency check to run - this is data being recalled by a User for Analysis).

So I wanted to declare the file as bad, and therefore force Rucio to retransfer from RAL Tape. I tried the following command, which completed without a message (or error): rucio-admin replicas declare-bad --reason 'File corrupted on disk' davs://webdav.echo.stfc.ac.uk:1094/store/data/Run2016D/BTagMu/AOD/21Feb2020_UL2016_HIPM-v1/50000/9B2C3BCF-F64E-354A-8BBC-9956CA3B9F6A.root

But when I check (even the next day) the file still appears as available in Rucio (and had not been re-transferred in the meantime):

rucio list-file-replicas cms:/store/data/Run2016D/BTagMu/AOD/21Feb2020_UL2016_HIPM-v1/50000/9B2C3BCF-F64E-354A-8BBC-9956CA3B9F6A.root --all-states

(A) T1_UK_RAL_Tape: root://antares.stfc.ac.uk:1094//eos/antares/prod/cms//store/data/Run2016D/BTagMu/AOD/21Feb2020_UL2016_HIPM-v1/50000/9B2C3BCF-F64E-354A-8BBC-9956CA3B9F6A.root        |
| cms     | /store/data/Run2016D/BTagMu/AOD/21Feb2020_UL2016_HIPM-v1/50000/9B2C3BCF-F64E-354A-8BBC-9956CA3B9F6A.root | 2.439 GB   | 571066d9  |
(A) T1_UK_RAL_Disk: davs://webdav.echo.stfc.ac.uk:1094/store/data/Run2016D/BTagMu/AOD/21Feb2020_UL2016_HIPM-v1/50000/9B2C3BCF-F64E-354A-8BBC-9956CA3B9F6A.root 

Here you see the file is not currently on disk:

gfal-ls davs://webdav.echo.stfc.ac.uk:1094/store/data/Run2016D/BTagMu/AOD/21Feb2020_UL2016_HIPM-v1/50000/9B2C3BCF-F64E-354A-8BBC-9956CA3B9F6A.root
gfal-ls error: 2 (No such file or directory) - Result HTTP 404 : File not found  after 1 attempts

How do I force Rucio to see the file as unavailable and therefore attempt a re-transfer from tape?

dynamic-entropy commented 1 year ago

I have been able to pinpoint the source of the error to: https://github.com/rucio/rucio/blob/6eac5dcc30cb6dac427cbc5070d46c170ae7fedf/lib/rucio/core/replica.py#L273-L283

The scope ends being extracted as store. And thus the cases when people reported it working were when we had the file path start with path prefix cms/* instead of /.

And will think of a solution soon. We will need to create an issue with rucio/rucio.

KatyEllis commented 1 year ago

Thanks for checking it out @dynamic-entropy ! So if I try the same command again with davs://webdav.echo.stfc.ac.uk:1094/cms:/store/data/Run2016D/BTagMu/AOD/21Feb2020_UL2016_HIPM-v1/50000/9B2C3BCF-F64E-354A-8BBC-9956CA3B9F6A.root then it might work?

dynamic-entropy commented 1 year ago

No, that is not what I meant. There are sites that have a prefix /cms/ for all files, for e.g. CNAF_Disk. But that is just by chance (it seems atlas has a mandatory /atlas prefix in all their sites).

KatyEllis commented 1 year ago

Ah I see. Yes, this is not guaranteed in CMS.

belforte commented 1 year ago

yeah...

WARNING : this part is ATLAS specific and must be changed

But surely there can be a less fragile way to find scope from a PFN. LIke asking user to provide it ! (just joking, there's no good way to get scope from a PFN)

I am saddened by my poor understanding of Rucio. I did not think that scope makes sense for a file replica. A file on disk.. well, it is a flle on dis. Why scope matters here ? We can put same file in multiple DIDs which have different scopes. My head is exploding !

dynamic-entropy commented 1 year ago

I think the way we use scopes is incompatible with how Atlas uses them and thus rucio still has traces of that idea. But also because scope for us does not determine the namespace where the file must go on storage and we have to force rules on the lfn to ensure that user scope should have a specific lfn prefix ( i.e. /store/user/rucio/username)

So, even though it does not make sense to us, it does for other experiments that rely on it for path resolution.

Now, for declaring replicas bad we can use:

client.declare_bad_file_replicas([{'scope':scope, 'name':name, 'rse_id':rse_id}], reason="testing declare bad")

This does not put us in the "getting rse, scope etc. from pfn flow".

belforte commented 1 year ago

thanks @dynamic-entropy yes about client.declare_bad_file_replicas([{'scope', 'name', 'rse_id'}], reason="") I noted that already. Although this is not possible via CLI as far as I could see.

This matter of scope to namespace to LFN path mapping appears critical, but also subtle and IMHO not well digested by myself at least ! It does not help that Rucio documentation suggests that "file" means a {scope,name} DID. Which goes against everybody's notion of what a file is.

So, do I understand correctly that a replica has a scope ? If so, what will happen when I create a container in my scope using file DID's from CMS scope ? Are new replicas created in Rucio ?

dynamic-entropy commented 1 year ago

Yes a replica has a scope.

No, because a container scope has nothing to do with a file scope. A container never maps to a physical path on the storage.

belforte commented 1 year ago

Now I guess I start making sense of this worrying sentence in Rucio documentation

Thus for files, the Logical File Name (LFN), a term commonly used in DataGrid terminology to identify files is equivalent to the DID in Rucio.

While until know I was thinking of {scope,name} as {scope,LFN} . I suspect that it is mostly a matter of which kind of LFN's people are used to. If one were to start from scratch, file organization like <rse-dependent-prefix>/scope/LFN surely makes sense.

OK let me try to write it in a different way.

In ATLAS, they decided that file DID {scope,LFN} is written on storage in /scope/LFN In CMS we decided not to change PFN's when moving to Rucio, so {'cms',LFN} stays in disk on /LFN That means that first /../ token in the PFN is not the scope, simply it is always store

so we have file names like /store/user/belforte/.. : Rucio knows nothing about /store/user/rucio/belforte/... : should only be used to construct replicas in user.belforte scope /store/anythingelse/.. : should only be used to construct replicas in cms scope

QUESTION: why do we need to put replicas of files from /store/user/rucio/belforte in user.belforte scope ? Since we expect that those files will be fully managed by Rucio as far as copy/move/delete goes.. why not put them in cms scope ? That will make it very easy to determine scope for CMS file DID's, even simpler than for ATLAS !

ericvaandering commented 1 year ago

"QUESTION: why do we need to put replicas of files from /store/user/rucio/belforte in user.belforte scope ? Since we expect that those files will be fully managed by Rucio as far as copy/move/delete goes.. why not put them in cms scope ? That will make it very easy to determine scope for CMS file DID's, even simpler than for ATLAS !"

Because we have limited who can make things in CMS scope.

Actually the ATLAS situation is more complex than you lay out. First, they have many scopes (like a scope per campaign). We could not do that because we needed a simple way to translated (scope, did) to PFN. In fact, ATLAS has a one way has to translate (scope, did) to LFN so it's not possible to calculate it backwards, only look it up.

belforte commented 1 year ago

I see. Indeed changes in CMS scope must be under control. Well... given that we transfer file ownershipo to Rucio robot at some point. It still makes some sense to move replicas in CMS scope as well, even if it needs to be done by an authorized daemon. But I do not think that we gain anything at the moment. Let's gather experience first.

thanks

ericvaandering commented 1 year ago

Scope ownership and certificate ownership are totally unrelated. The latter is non-negotiable. There is no way in Rucio for Rucio managed data to not be owned by Rucio. But within Rucio, we can have user scopes which have relaxed rules as long as they are not writing in the CMS LFN/PFN namespace.

On Jun 26, 2023, at 9:22 AM, Stefano Belforte @.***> wrote:

I see. Indeed changes in CMS scope must be under control. Well... given that we transfer file ownershipo to Rucio robot at some point. It still makes some sense to move replicas in CMS scope as well, even if it needs to be done by an authorized daemon. But I do not think that we gain anything at the moment. Let's gather experience first.

thanks

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_530-23issuecomment-2D1607599485&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Ot-oGEGQLMi4DmpNOiOijuVQJ5_zTx6VEI_Yo7kx80NtxaLBVz1Xll5HjNKPIQ_c&s=hj5MSivXlKXS-CdO5jcj4Qu_Stsx3FULt-X9lxeQDjI&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLUGLJK5RXCLGERFXOLXNGLJVANCNFSM6AAAAAAZEVLLI4&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Ot-oGEGQLMi4DmpNOiOijuVQJ5_zTx6VEI_Yo7kx80NtxaLBVz1Xll5HjNKPIQ_c&s=dCMZtlc2Kc0rKYRFqCfh5tovYCetcnbv7QrH9l_e1xw&e=. You are receiving this because you commented.

KatyEllis commented 1 year ago

Hi, is there any progress on the functionality I mention in my description? I have a file at RAL I would like to declare bad. Katy

belforte commented 1 year ago

Katy, can you use python client as Rahul indicated ?

client.declare_bad_file_replicas([{'scope':scope, 'name':name, 'rse_id':rse_id}], reason="testing declare bad")
voetberg commented 10 months ago

I started working on a patch for this in the in https://github.com/rucio/rucio/commit/d8dc808a48c9b870ec1f7d72d8790e39683af6c8 - Which is just a policy package plugin to parse pfns based on the config