Closed dciangot closed 6 months ago
T2_FR_GRIF_LLR_Temp:
RSE does not support provided filename.\nDetails: Invalid prefix: provided /dpm/in2p3.fr/home/cms/trivcat/store/temp expected //dpm/in2p3.fr/home/cms/
T2_US_Vanderbilt_Temp:
RSE does not support provided filename.\nDetails: Invalid hostname: provided gridftp.accre.vanderbilt.edu
T2_ES_CIEMAT_Temp
Details: Invalid prefix: provided '/pnfs/ciemat.es/data/cms/scratch/user', expected '/pnfs/ciemat.es/data/cms/prod/store/temp'
pfn from phedex: srm://srm.ciemat.es:8443/srm/managerv2?SFN=/pnfs/ciemat.es/data/cms/scratch/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New5/200821_073500/0000/tree_61.root
T2_DE_DESY_Temp
Details: Invalid prefix: provided '/pnfs/desy.de/cms/tier2/temp', expected '/pnfs/desy.de/cms/tier2/store/temp'
pfn from Phedex: srm://dcache-se-cms.desy.de:8443/srm/managerv2?SFN=/pnfs/desy.de/cms/tier2/temp/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New5/200821_073500/0000/tree_50.root
T2_UK_London_Brunel_Temp
Details: Invalid prefix: provided '/dpm/brunel.ac.uk/home/cms/store/temp', expected '//dpm/brunel.ac.uk/home/cms/store/temp'
pfn from Phedex: gsiftp://dc2-grid-64.brunel.ac.uk/dpm/brunel.ac.uk/home/cms/store/temp/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New5/200821_073500/0000/tree_75.root
T2_UK_SGrid_Bristol_Temp
Details: Invalid prefix: provided '/dpm/phy.bris.ac.uk/home/cms/store/temp', expected '//dpm/phy.bris.ac.uk/home/cms/store/temp'
pfn from Phedex: gsiftp://lcgse01.phy.bris.ac.uk/dpm/phy.bris.ac.uk/home/cms/store/temp/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New5/200821_073500/0000/tree_47.root
T2_US_MIT_Temp
Details: Invalid prefix: provided '/cms/store/temp', expected '//cms/store/temp'
pfn from Phedex: gsiftp://se01.cmsaf.mit.edu:2811/cms/store/temp/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New8/200821_095417/0000/tree_84.root
T1_IT_CNAF_Disk_Temp (used via htcondor overflow)
Details: RSE 'T1_IT_CNAF_Disk_Temp' cannot be found in vo 'def'
T2_CH_CERN_Temp
rucio.common.exception.RSEFileNameNotSupported: RSE does not support provided filename. Details: Invalid prefix: provided '/eos/cms/store/temp', expected '//eos/cms/store/temp'
I came back to this finally and wrote a script to validate uploading and getting back the correct PFN from rucio:
from rucio.client import Client
import gfal2
if __name__ == "__main__":
client = Client()
ctx = gfal2.creat_context()
srcfile = "file:///afs/cern.ch/user/n/ncsmith/nicks.test.file"
params = ctx.transfer_parameters()
params.create_parent = True
params.overwrite = True
rses = sorted(item["rse"] for item in client.list_rses(r"cms_type=temp\greedyDeletion=true"))
for rse in rses:
lfn = "user.ncsmith:/store/temp/user/ncsmith.upload01/nicks.test.file"
pfn = client.lfns2pfns(rse.replace("_Temp", ""), [lfn])[lfn]
fileinfo = {
"scope": "user.ncsmith",
"name": "/store/user/rucio/ncsmith/test/nicks.test.file.2",
"bytes": 10485760,
"adler32": "698c2948",
"pfn": pfn,
}
try:
basedir = pfn.replace("/ncsmith.upload01/nicks.test.file", "")
ctx.stat(basedir)
except gfal2.GError as ex:
print(f"Missing base dir at {rse}: {basedir}")
continue
try:
ctx.filecopy(params, srcfile, pfn)
except gfal2.GError as ex:
print(f"Failed to upload to {rse}\nPFN: {pfn}\nError: {ex.message}")
continue
try:
client.add_replicas(rse, [fileinfo])
except Exception as ex:
print(f"Failed to register replica at {rse}\nError: {ex}")
continue
rep = next(client.list_replicas([fileinfo], rse_expression=rse))
if pfn not in rep["pfns"]:
print(f"For {rse} failed to roundtrip pfn: returned pfns: {rep['pfns']}")
continue
print(f"Success at {rse}")
Once validated, I set greedyDeletion=True
(which we want anyway) The remaining temp RSEs can be queried with
$ rucio list-rses --rse "cms_type=temp\greedyDeletion=true"
T3_CH_PSI_Temp
T3_TW_NCU_Temp
T1_US_FNAL_Disk_Temp
T3_US_CMU_Temp
T2_BR_UERJ_Temp
What remains to be checked is that rucio can successfully delete these replicas. (i.e. the production account can delete what the user account creates)
For most sites rucio DN was able to delete the files I made. The following do not yet have it:
T2_BR_UERJ_Temp
T2_UK_London_IC_Temp
T2_IT_Legnaro_Temp
T2_AT_Vienna_Temp
T2_EE_Estonia_Temp
T2_US_Caltech_Temp
T3_IT_Trieste_Temp
T2_DE_DESY_Temp
T2_BE_UCL_Temp
T2_US_Florida_Temp
T2_FI_HIP_Temp
I've opened ggus tickets for each. For the sites where it works I've set reaper=True
for the temp RSEs, which completes their commissioning. So rucio list-rses --rse "cms_type=temp&greedyDeletion=true&reaper=true"
should enumerate the functioning sites.
FYI @belforte
thanks @nsmith- . Of course this is unrelated to the "ASO tries to cleanup" which we discussed in MatterMost. But I see that there's quite some come and operational hints here which we need to import.
Another change needed to allow rucio upload
to work for users is a permissions patch
diff --git a/lib/rucio/core/permission/cms.py b/lib/rucio/core/permission/cms.py
index 8f68cf9a9..a2c01051f 100644
--- a/lib/rucio/core/permission/cms.py
+++ b/lib/rucio/core/permission/cms.py
@@ -738,7 +738,11 @@ def perm_update_replicas_states(issuer, kwargs):
:param kwargs: List of arguments for the action.
:returns: True if account is allowed, otherwise False
"""
- return _is_root(issuer) or has_account_attribute(account=issuer, key='admin')
+ is_root = _is_root(issuer)
+ is_temp = str(kwargs.get('rse', '')).endswith('_Temp')
+ is_admin = has_account_attribute(account=issuer, key='admin')
+
+ return is_root or is_temp or is_admin
This is related to #343
I am very glad that this is not a forgotten issue and still gets updates one year after last time. But it still was dormant for very long and it is not assigned to anybody. Is someone at least going to follow up on Nick's code snipped above and make a PR and merge it ? Can the status of Temp RSE be verified (again) and the initial point of this ticket be addressed and closed with a "all Temp RSE's work now and ops will follow up on any new problem" ?
Sorry if I sound like asking busy people to do more work.... well, all of us are in that situation, don't we ? But as Lucas Taylor[1] taught me 18 years ago "if you want something done, ask the busy man" :-)
[1] an English man working in an USA institution who did lot of useful work in CMS Computing back in the early days like create the CMS remote control room in Bat 14 and somehow disappeared at some point, so I am not sure if anybody else who reads this ever knew him.
The impasse was that when I went to make the GGUS tickets to the 10 sites where the rucio DN could not delete files created by a user DN in /store/temp/user/...
(necessary to be able to clean up after a successful transfer to the destination RSE), some sites said such behavior was either not in policy or not even possible with their storage.
Current policy (https://twiki.cern.ch/twiki/bin/viewauth/CMS/DMWMPG_Namespace) states:
Top level namespace | Description | VOMS role permissions for writing |
---|---|---|
/store/temp/user/ | Location for temporary storage of user data at sites that aren't "home" institutes for a user. Files older than 1 week can be deleted automatically (sites are advised to establish a cron'ed clean up) | no special role necessary |
I suppose we could rely on site cleanup and make rucio assume that if deletion fails for the temp RSE due to permission error that it does not matter since the sites will cleanup 1w old files automatically (quite the hack). The other option is to work with sites to implement a policy that allows rucio DN to delete these files in this area. Perhaps @stlammel would have some idea.
If this issue is resolved one way or another, then there still is a need to check the LFN-to-PFN situation at the sites $ rucio list-rses --rse "cms_type=temp\greedyDeletion=true"
, after which the temp RSEs can be considered ready for use by CRAB or rucio upload
.
An additional consideration is to make sure users can't hijack the temp RSEs with long-lasting rules.
If i understand correctly this is a temporary problem, while we write CRAB temporary output via x509/Macaroon, right? (Once we write the output via IAM-issued tokens, the experiment owns the data and Rucio can clean things up just fine.) Relying on the site cleanup, as we do currently in the interim sounds fine to me.
Yes, it makes sense to wait for moving to tokens to review the who-can-write-delete thing, though I presume that it will be quite some time before all site's storage servers understand tokens. We certainly need to update policies.
But I see files in /store/user/temp written 5 months ago.. it is possible that some sites only cleans when needed, and it is possible that the current CRAB attempt to cleanup is largely successful (since it is done with user X509 credential) so we leak very little disk space and some sites may just ignore it. So IMHO we need to worry about cleaning up with current authorization setup. All in all CRAB do-ASO-via-Rucio script currently runs with user X509 authentication, so it could certainly keep issuing a gfal-rm with more or less same success rate as now. Sounds like a good hack for the few next years. @dciangot what do you think ?
I do not understand this part. Do I need to ?
If this issue is resolved one way or another, then there still is a need to check the LFN-to-PFN situation at the sites $ rucio list-rses --rse "cms_type=temp\greedyDeletion=true", after which the temp RSEs can be considered ready for use by CRAB or rucio upload.
An additional consideration is to make sure users can't hijack the temp RSEs with long-lasting rules.
?? Do we make/need/allow rules on TEMP RSE's ? CRAB's ASO-via-Rucio will only create rules targeting the local Rucio RSE at the site where the user has a quota.
So, 5 month old files surprises me a bit. We tell sites not to remove files younger than two weeks but not when to clean up. I would assume most sites kept their old unmerged-policy cleanup with a >6 week grace period.
So, 5 month old files surprises me a bit. We tell sites not to remove files younger than two weeks but not when to clean up. I would assume most sites kept their old unmerged-policy cleanup with a >6 week grace period.
* Stephan
Did I confuse "Rucio says there's a replica of this old file" with "the file is actually there" ? well.. yes, but here's also a real file written on May 2nd and apparently still there.
belforte@lxplus8s05/~> gfal-ls -l davs://sbgse1.in2p3.fr:443/dpm/in2p3.fr/home/cms/phedex/store/temp/user/belforte.be1f4dc5be8664cbd145bf008f5399adf42b086f/belforte/prova/GenericTTbar/Stefano-Test-220406/220502_162549/0000/kk_1.root
-rw-rw---- 0 3916 103 561301 May 2 18:29 davs://sbgse1.in2p3.fr:443/dpm/in2p3.fr/home/cms/phedex/store/temp/user/belforte.be1f4dc5be8664cbd145bf008f5399adf42b086f/belforte/prova/GenericTTbar/Stefano-Test-220406/220502_162549/0000/kk_1.root
belforte@lxplus8s05/~>
I presume that storage consistency checks do not cover temp RSE's, which means that if we add a gfal-rm
to CRAB's RUCIO_transfer.py
we should also add some call to rucio to remove that replica.
I will sort this out with Diego.
I agree with you @belforte
push for changing RUCIO policies for user scopes
Yes ! Do we track that somewhere already ? Is Eric convinced ? Martin ?
This one https://github.com/dmwm/CMSRucio/issues/297 where at this point does not involve only declaring bad replicas
I want to start a review of user and site admin permissions. Can you open issues in dmwm/CMSRucio for the replica issue and whatever else you want for user permissions so we can collect everything?
Hi Eric, that'd be https://github.com/dmwm/CMSRucio/issues/297 . But I agree that there's too much discussion in there, even if in the end simple conclusions were reached. Shall we make a new, terse, one ?
Concise comment on that one will do. 😄
Sent from a mobile device.
On Oct 7, 2022, at 6:30 PM, Stefano Belforte @.***> wrote:

Hi Eric, that'd be #297https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_297&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=_SBrbykd-dFAydKJXk6q12X1R3DsuOGQQoTZJAXL1tnXPkH7zJQL4NF7lyEVW99g&s=cdBzJdf-tXO8jYauFpQI4WVgC__6ZakhfDypn310LQw&e= . But I agree that there's too much discussion in there, even if in the end simple conclusions were reached. Shall we make a new, terse, one ?
— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_159-23issuecomment-2D1272138704&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=_SBrbykd-dFAydKJXk6q12X1R3DsuOGQQoTZJAXL1tnXPkH7zJQL4NF7lyEVW99g&s=axBxjgwKhzTJ-mM7CSN64QfoqkmL_yW4I29uv59Y55Y&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLXOCZAGR5VAYD6HEVLWCCRDFANCNFSM4QG7EHQA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=_SBrbykd-dFAydKJXk6q12X1R3DsuOGQQoTZJAXL1tnXPkH7zJQL4NF7lyEVW99g&s=BbgmNKSizI1e0tiBZuC_pwpFLr6655IkNe2BnJNLoB0&e=. You are receiving this because you commented.Message ID: @.***>
Hello @dciangot @belforte Is this still relevant?
AFAIK all Teemp RSE's work. Lets close this and open ad-hoc issues in case we find specific problems.
Thankyou Stefano
I'd like to collect in this ticket all the Temp RSE that we are going to find broken for user output registration. Since I think that following the example config here(*), at least Stefano and Nick can start to try different RSEs for their workflow.
(*) https://twiki.cern.ch/twiki/bin/view/CMSPublic/RucioUserDocsData