Open KatyEllis opened 1 year ago
I have been able to pinpoint the source of the error to: https://github.com/rucio/rucio/blob/6eac5dcc30cb6dac427cbc5070d46c170ae7fedf/lib/rucio/core/replica.py#L273-L283
The scope ends being extracted as store
. And thus the cases when people reported it working were when we had the file path start with path prefix cms/*
instead of /
.
And will think of a solution soon. We will need to create an issue with rucio/rucio.
Thanks for checking it out @dynamic-entropy ! So if I try the same command again with
davs://webdav.echo.stfc.ac.uk:1094/cms:/store/data/Run2016D/BTagMu/AOD/21Feb2020_UL2016_HIPM-v1/50000/9B2C3BCF-F64E-354A-8BBC-9956CA3B9F6A.root
then it might work?
No, that is not what I meant.
There are sites that have a prefix /cms/
for all files, for e.g. CNAF_Disk.
But that is just by chance (it seems atlas has a mandatory /atlas
prefix in all their sites).
Ah I see. Yes, this is not guaranteed in CMS.
yeah...
But surely there can be a less fragile way to find scope from a PFN. LIke asking user to provide it ! (just joking, there's no good way to get scope from a PFN)
I am saddened by my poor understanding of Rucio. I did not think that scope makes sense for a file replica. A file on disk.. well, it is a flle on dis. Why scope matters here ? We can put same file in multiple DIDs which have different scopes. My head is exploding !
I think the way we use scopes is incompatible with how Atlas uses them and thus rucio still has traces of that idea.
But also because scope for us does not determine the namespace where the file must go on storage and we have to force rules on the lfn to ensure that user scope should have a specific lfn prefix ( i.e. /store/user/rucio/username
)
So, even though it does not make sense to us, it does for other experiments that rely on it for path resolution.
Now, for declaring replicas bad we can use:
client.declare_bad_file_replicas([{'scope':scope, 'name':name, 'rse_id':rse_id}], reason="testing declare bad")
This does not put us in the "getting rse, scope etc. from pfn flow".
thanks @dynamic-entropy
yes about
client.declare_bad_file_replicas([{'scope', 'name', 'rse_id'}], reason="")
I noted that already. Although this is not possible via CLI as far as I could see.
This matter of scope to namespace to LFN path mapping appears critical, but also subtle and IMHO not well digested by myself at least ! It does not help that Rucio documentation suggests that "file" means a {scope,name} DID. Which goes against everybody's notion of what a file is.
So, do I understand correctly that a replica has a scope ? If so, what will happen when I create a container in my scope using file DID's from CMS scope ? Are new replicas created in Rucio ?
Yes a replica has a scope.
No, because a container scope has nothing to do with a file scope. A container never maps to a physical path on the storage.
Now I guess I start making sense of this worrying sentence in Rucio documentation
Thus for files, the Logical File Name (LFN), a term commonly used in DataGrid terminology to identify files is equivalent to the DID in Rucio.
While until know I was thinking of {scope,name} as {scope,LFN} . I suspect that it is mostly a matter of which kind of LFN's people are used to. If one were to start from scratch, file organization like <rse-dependent-prefix>/scope/LFN
surely makes sense.
OK let me try to write it in a different way.
In ATLAS, they decided that file DID {scope,LFN} is written on storage in store
so we have file names like /store/user/belforte/.. : Rucio knows nothing about /store/user/rucio/belforte/... : should only be used to construct replicas in user.belforte scope /store/anythingelse/.. : should only be used to construct replicas in cms scope
QUESTION: why do we need to put replicas of files from /store/user/rucio/belforte in user.belforte
scope ? Since we expect that those files will be fully managed by Rucio as far as copy/move/delete goes.. why not put them in cms
scope ? That will make it very easy to determine scope for CMS file DID's, even simpler than for ATLAS !
"QUESTION: why do we need to put replicas of files from /store/user/rucio/belforte in user.belforte scope ? Since we expect that those files will be fully managed by Rucio as far as copy/move/delete goes.. why not put them in cms scope ? That will make it very easy to determine scope for CMS file DID's, even simpler than for ATLAS !"
Because we have limited who can make things in CMS scope.
Actually the ATLAS situation is more complex than you lay out. First, they have many scopes (like a scope per campaign). We could not do that because we needed a simple way to translated (scope, did) to PFN. In fact, ATLAS has a one way has to translate (scope, did) to LFN so it's not possible to calculate it backwards, only look it up.
I see. Indeed changes in CMS scope must be under control. Well... given that we transfer file ownershipo to Rucio robot at some point. It still makes some sense to move replicas in CMS scope as well, even if it needs to be done by an authorized daemon. But I do not think that we gain anything at the moment. Let's gather experience first.
thanks
Scope ownership and certificate ownership are totally unrelated. The latter is non-negotiable. There is no way in Rucio for Rucio managed data to not be owned by Rucio. But within Rucio, we can have user scopes which have relaxed rules as long as they are not writing in the CMS LFN/PFN namespace.
On Jun 26, 2023, at 9:22 AM, Stefano Belforte @.***> wrote:
I see. Indeed changes in CMS scope must be under control. Well... given that we transfer file ownershipo to Rucio robot at some point. It still makes some sense to move replicas in CMS scope as well, even if it needs to be done by an authorized daemon. But I do not think that we gain anything at the moment. Let's gather experience first.
thanks
— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_530-23issuecomment-2D1607599485&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Ot-oGEGQLMi4DmpNOiOijuVQJ5_zTx6VEI_Yo7kx80NtxaLBVz1Xll5HjNKPIQ_c&s=hj5MSivXlKXS-CdO5jcj4Qu_Stsx3FULt-X9lxeQDjI&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLUGLJK5RXCLGERFXOLXNGLJVANCNFSM6AAAAAAZEVLLI4&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=Ot-oGEGQLMi4DmpNOiOijuVQJ5_zTx6VEI_Yo7kx80NtxaLBVz1Xll5HjNKPIQ_c&s=dCMZtlc2Kc0rKYRFqCfh5tovYCetcnbv7QrH9l_e1xw&e=. You are receiving this because you commented.
Hi, is there any progress on the functionality I mention in my description? I have a file at RAL I would like to declare bad. Katy
Katy, can you use python client as Rahul indicated ?
client.declare_bad_file_replicas([{'scope':scope, 'name':name, 'rse_id':rse_id}], reason="testing declare bad")
I started working on a patch for this in the in https://github.com/rucio/rucio/commit/d8dc808a48c9b870ec1f7d72d8790e39683af6c8 - Which is just a policy package plugin to parse pfns based on the config
Hi,
I've suspected this functionality was not working in the past but assumed it was fixed - perhaps not.
I found a corrupt file on RAL disk and deleted it manually (this was required due to the nature of the corruption). Rucio still thought the file was present (and we do not want to wait one week for the next consistency check to run - this is data being recalled by a User for Analysis).
So I wanted to declare the file as bad, and therefore force Rucio to retransfer from RAL Tape. I tried the following command, which completed without a message (or error):
rucio-admin replicas declare-bad --reason 'File corrupted on disk' davs://webdav.echo.stfc.ac.uk:1094/store/data/Run2016D/BTagMu/AOD/21Feb2020_UL2016_HIPM-v1/50000/9B2C3BCF-F64E-354A-8BBC-9956CA3B9F6A.root
But when I check (even the next day) the file still appears as available in Rucio (and had not been re-transferred in the meantime):
Here you see the file is not currently on disk:
How do I force Rucio to see the file as unavailable and therefore attempt a re-transfer from tape?