dmwm / CMSRucio

7 stars 31 forks source link

Temp RSE validation #159

Closed dciangot closed 6 months ago

dciangot commented 4 years ago

I'd like to collect in this ticket all the Temp RSE that we are going to find broken for user output registration. Since I think that following the example config here(*), at least Stefano and Nick can start to try different RSEs for their workflow.

(*) https://twiki.cern.ch/twiki/bin/view/CMSPublic/RucioUserDocsData

dciangot commented 4 years ago

T2_FR_GRIF_LLR_Temp: RSE does not support provided filename.\nDetails: Invalid prefix: provided /dpm/in2p3.fr/home/cms/trivcat/store/temp expected //dpm/in2p3.fr/home/cms/

dciangot commented 4 years ago

T2_US_Vanderbilt_Temp: RSE does not support provided filename.\nDetails: Invalid hostname: provided gridftp.accre.vanderbilt.edu

dciangot commented 4 years ago

T2_ES_CIEMAT_Temp Details: Invalid prefix: provided '/pnfs/ciemat.es/data/cms/scratch/user', expected '/pnfs/ciemat.es/data/cms/prod/store/temp'

pfn from phedex: srm://srm.ciemat.es:8443/srm/managerv2?SFN=/pnfs/ciemat.es/data/cms/scratch/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New5/200821_073500/0000/tree_61.root

dciangot commented 4 years ago

T2_DE_DESY_Temp Details: Invalid prefix: provided '/pnfs/desy.de/cms/tier2/temp', expected '/pnfs/desy.de/cms/tier2/store/temp'

pfn from Phedex: srm://dcache-se-cms.desy.de:8443/srm/managerv2?SFN=/pnfs/desy.de/cms/tier2/temp/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New5/200821_073500/0000/tree_50.root

dciangot commented 4 years ago

T2_UK_London_Brunel_Temp

Details: Invalid prefix: provided '/dpm/brunel.ac.uk/home/cms/store/temp', expected '//dpm/brunel.ac.uk/home/cms/store/temp'

pfn from Phedex: gsiftp://dc2-grid-64.brunel.ac.uk/dpm/brunel.ac.uk/home/cms/store/temp/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New5/200821_073500/0000/tree_75.root

dciangot commented 4 years ago

T2_UK_SGrid_Bristol_Temp

Details: Invalid prefix: provided '/dpm/phy.bris.ac.uk/home/cms/store/temp', expected '//dpm/phy.bris.ac.uk/home/cms/store/temp'

pfn from Phedex: gsiftp://lcgse01.phy.bris.ac.uk/dpm/phy.bris.ac.uk/home/cms/store/temp/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New5/200821_073500/0000/tree_47.root

dciangot commented 4 years ago

T2_US_MIT_Temp

Details: Invalid prefix: provided '/cms/store/temp', expected '//cms/store/temp'

pfn from Phedex: gsiftp://se01.cmsaf.mit.edu:2811/cms/store/temp/user/dciangot.28635c9960121d1b61eae2ddf9426bf8701e7430/dciangot/NanoRucio/DYJetsToLL_1J_TuneCP5_13TeV-amcatnloFXFX-pythia8/NanoTestPost-New8/200821_095417/0000/tree_84.root

dciangot commented 4 years ago

T1_IT_CNAF_Disk_Temp (used via htcondor overflow)

Details: RSE 'T1_IT_CNAF_Disk_Temp' cannot be found in vo 'def'

dciangot commented 4 years ago

T2_CH_CERN_Temp

rucio.common.exception.RSEFileNameNotSupported: RSE does not support provided filename. Details: Invalid prefix: provided '/eos/cms/store/temp', expected '//eos/cms/store/temp'

nsmith- commented 3 years ago

I came back to this finally and wrote a script to validate uploading and getting back the correct PFN from rucio:

from rucio.client import Client
import gfal2

if __name__ == "__main__":
    client = Client()
    ctx = gfal2.creat_context()
    srcfile = "file:///afs/cern.ch/user/n/ncsmith/nicks.test.file"
    params = ctx.transfer_parameters()
    params.create_parent = True
    params.overwrite = True

    rses = sorted(item["rse"] for item in client.list_rses(r"cms_type=temp\greedyDeletion=true"))
    for rse in rses:
        lfn = "user.ncsmith:/store/temp/user/ncsmith.upload01/nicks.test.file"
        pfn = client.lfns2pfns(rse.replace("_Temp", ""), [lfn])[lfn]
        fileinfo = {
            "scope": "user.ncsmith",
            "name": "/store/user/rucio/ncsmith/test/nicks.test.file.2",
            "bytes": 10485760,
            "adler32": "698c2948",
            "pfn": pfn,
        }
        try:
            basedir = pfn.replace("/ncsmith.upload01/nicks.test.file", "")
            ctx.stat(basedir)
        except gfal2.GError as ex:
            print(f"Missing base dir at {rse}: {basedir}")
            continue
        try:
            ctx.filecopy(params, srcfile, pfn)
        except gfal2.GError as ex:
            print(f"Failed to upload to {rse}\nPFN: {pfn}\nError: {ex.message}")
            continue
        try:
            client.add_replicas(rse, [fileinfo])
        except Exception as ex:
            print(f"Failed to register replica at {rse}\nError: {ex}")
            continue
        rep = next(client.list_replicas([fileinfo], rse_expression=rse))
        if pfn not in rep["pfns"]:
            print(f"For {rse} failed to roundtrip pfn: returned pfns: {rep['pfns']}")
            continue
        print(f"Success at {rse}")

Once validated, I set greedyDeletion=True (which we want anyway) The remaining temp RSEs can be queried with

$ rucio list-rses --rse "cms_type=temp\greedyDeletion=true"
T3_CH_PSI_Temp
T3_TW_NCU_Temp
T1_US_FNAL_Disk_Temp
T3_US_CMU_Temp
T2_BR_UERJ_Temp

What remains to be checked is that rucio can successfully delete these replicas. (i.e. the production account can delete what the user account creates)

nsmith- commented 3 years ago

For most sites rucio DN was able to delete the files I made. The following do not yet have it:

T2_BR_UERJ_Temp
T2_UK_London_IC_Temp
T2_IT_Legnaro_Temp
T2_AT_Vienna_Temp
T2_EE_Estonia_Temp
T2_US_Caltech_Temp
T3_IT_Trieste_Temp
T2_DE_DESY_Temp
T2_BE_UCL_Temp
T2_US_Florida_Temp
T2_FI_HIP_Temp

I've opened ggus tickets for each. For the sites where it works I've set reaper=True for the temp RSEs, which completes their commissioning. So rucio list-rses --rse "cms_type=temp&greedyDeletion=true&reaper=true" should enumerate the functioning sites.

nsmith- commented 3 years ago

FYI @belforte

belforte commented 3 years ago

thanks @nsmith- . Of course this is unrelated to the "ASO tries to cleanup" which we discussed in MatterMost. But I see that there's quite some come and operational hints here which we need to import.

nsmith- commented 2 years ago

Another change needed to allow rucio upload to work for users is a permissions patch

diff --git a/lib/rucio/core/permission/cms.py b/lib/rucio/core/permission/cms.py
index 8f68cf9a9..a2c01051f 100644
--- a/lib/rucio/core/permission/cms.py
+++ b/lib/rucio/core/permission/cms.py
@@ -738,7 +738,11 @@ def perm_update_replicas_states(issuer, kwargs):
     :param kwargs: List of arguments for the action.
     :returns: True if account is allowed, otherwise False
     """
-    return _is_root(issuer) or has_account_attribute(account=issuer, key='admin')
+    is_root = _is_root(issuer)
+    is_temp = str(kwargs.get('rse', '')).endswith('_Temp')
+    is_admin = has_account_attribute(account=issuer, key='admin')
+
+    return is_root or is_temp or is_admin

This is related to #343

belforte commented 2 years ago

I am very glad that this is not a forgotten issue and still gets updates one year after last time. But it still was dormant for very long and it is not assigned to anybody. Is someone at least going to follow up on Nick's code snipped above and make a PR and merge it ? Can the status of Temp RSE be verified (again) and the initial point of this ticket be addressed and closed with a "all Temp RSE's work now and ops will follow up on any new problem" ?

Sorry if I sound like asking busy people to do more work.... well, all of us are in that situation, don't we ? But as Lucas Taylor[1] taught me 18 years ago "if you want something done, ask the busy man" :-)

[1] an English man working in an USA institution who did lot of useful work in CMS Computing back in the early days like create the CMS remote control room in Bat 14 and somehow disappeared at some point, so I am not sure if anybody else who reads this ever knew him.

nsmith- commented 2 years ago

The impasse was that when I went to make the GGUS tickets to the 10 sites where the rucio DN could not delete files created by a user DN in /store/temp/user/... (necessary to be able to clean up after a successful transfer to the destination RSE), some sites said such behavior was either not in policy or not even possible with their storage. Current policy (https://twiki.cern.ch/twiki/bin/viewauth/CMS/DMWMPG_Namespace) states:

Top level namespace Description VOMS role permissions for writing
/store/temp/user/ Location for temporary storage of user data at sites that aren't "home" institutes for a user. Files older than 1 week can be deleted automatically (sites are advised to establish a cron'ed clean up) no special role necessary

I suppose we could rely on site cleanup and make rucio assume that if deletion fails for the temp RSE due to permission error that it does not matter since the sites will cleanup 1w old files automatically (quite the hack). The other option is to work with sites to implement a policy that allows rucio DN to delete these files in this area. Perhaps @stlammel would have some idea.

If this issue is resolved one way or another, then there still is a need to check the LFN-to-PFN situation at the sites $ rucio list-rses --rse "cms_type=temp\greedyDeletion=true", after which the temp RSEs can be considered ready for use by CRAB or rucio upload.

nsmith- commented 2 years ago

An additional consideration is to make sure users can't hijack the temp RSEs with long-lasting rules.

stlammel commented 2 years ago

If i understand correctly this is a temporary problem, while we write CRAB temporary output via x509/Macaroon, right? (Once we write the output via IAM-issued tokens, the experiment owns the data and Rucio can clean things up just fine.) Relying on the site cleanup, as we do currently in the interim sounds fine to me.

belforte commented 2 years ago

Yes, it makes sense to wait for moving to tokens to review the who-can-write-delete thing, though I presume that it will be quite some time before all site's storage servers understand tokens. We certainly need to update policies.

But I see files in /store/user/temp written 5 months ago.. it is possible that some sites only cleans when needed, and it is possible that the current CRAB attempt to cleanup is largely successful (since it is done with user X509 credential) so we leak very little disk space and some sites may just ignore it. So IMHO we need to worry about cleaning up with current authorization setup. All in all CRAB do-ASO-via-Rucio script currently runs with user X509 authentication, so it could certainly keep issuing a gfal-rm with more or less same success rate as now. Sounds like a good hack for the few next years. @dciangot what do you think ?

belforte commented 2 years ago

I do not understand this part. Do I need to ?

If this issue is resolved one way or another, then there still is a need to check the LFN-to-PFN situation at the sites $ rucio list-rses --rse "cms_type=temp\greedyDeletion=true", after which the temp RSEs can be considered ready for use by CRAB or rucio upload.

belforte commented 2 years ago

An additional consideration is to make sure users can't hijack the temp RSEs with long-lasting rules.

?? Do we make/need/allow rules on TEMP RSE's ? CRAB's ASO-via-Rucio will only create rules targeting the local Rucio RSE at the site where the user has a quota.

stlammel commented 2 years ago

So, 5 month old files surprises me a bit. We tell sites not to remove files younger than two weeks but not when to clean up. I would assume most sites kept their old unmerged-policy cleanup with a >6 week grace period.

belforte commented 2 years ago

So, 5 month old files surprises me a bit. We tell sites not to remove files younger than two weeks but not when to clean up. I would assume most sites kept their old unmerged-policy cleanup with a >6 week grace period.

* Stephan

Did I confuse "Rucio says there's a replica of this old file" with "the file is actually there" ? well.. yes, but here's also a real file written on May 2nd and apparently still there.

belforte@lxplus8s05/~> gfal-ls -l  davs://sbgse1.in2p3.fr:443/dpm/in2p3.fr/home/cms/phedex/store/temp/user/belforte.be1f4dc5be8664cbd145bf008f5399adf42b086f/belforte/prova/GenericTTbar/Stefano-Test-220406/220502_162549/0000/kk_1.root
-rw-rw----   0 3916  103      561301 May  2 18:29 davs://sbgse1.in2p3.fr:443/dpm/in2p3.fr/home/cms/phedex/store/temp/user/belforte.be1f4dc5be8664cbd145bf008f5399adf42b086f/belforte/prova/GenericTTbar/Stefano-Test-220406/220502_162549/0000/kk_1.root    
belforte@lxplus8s05/~> 

I presume that storage consistency checks do not cover temp RSE's, which means that if we add a gfal-rm to CRAB's RUCIO_transfer.py we should also add some call to rucio to remove that replica. I will sort this out with Diego.

dciangot commented 2 years ago

I agree with you @belforte

belforte commented 2 years ago

push for changing RUCIO policies for user scopes

Yes ! Do we track that somewhere already ? Is Eric convinced ? Martin ?

dciangot commented 2 years ago

This one https://github.com/dmwm/CMSRucio/issues/297 where at this point does not involve only declaring bad replicas

ericvaandering commented 2 years ago

I want to start a review of user and site admin permissions. Can you open issues in dmwm/CMSRucio for the replica issue and whatever else you want for user permissions so we can collect everything?

belforte commented 2 years ago

Hi Eric, that'd be https://github.com/dmwm/CMSRucio/issues/297 . But I agree that there's too much discussion in there, even if in the end simple conclusions were reached. Shall we make a new, terse, one ?

ericvaandering commented 2 years ago

Concise comment on that one will do. 😄

Sent from a mobile device.

On Oct 7, 2022, at 6:30 PM, Stefano Belforte @.***> wrote:



Hi Eric, that'd be #297https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_297&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=_SBrbykd-dFAydKJXk6q12X1R3DsuOGQQoTZJAXL1tnXPkH7zJQL4NF7lyEVW99g&s=cdBzJdf-tXO8jYauFpQI4WVgC__6ZakhfDypn310LQw&e= . But I agree that there's too much discussion in there, even if in the end simple conclusions were reached. Shall we make a new, terse, one ?

— Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CMSRucio_issues_159-23issuecomment-2D1272138704&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=_SBrbykd-dFAydKJXk6q12X1R3DsuOGQQoTZJAXL1tnXPkH7zJQL4NF7lyEVW99g&s=axBxjgwKhzTJ-mM7CSN64QfoqkmL_yW4I29uv59Y55Y&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLXOCZAGR5VAYD6HEVLWCCRDFANCNFSM4QG7EHQA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=_SBrbykd-dFAydKJXk6q12X1R3DsuOGQQoTZJAXL1tnXPkH7zJQL4NF7lyEVW99g&s=BbgmNKSizI1e0tiBZuC_pwpFLr6655IkNe2BnJNLoB0&e=. You are receiving this because you commented.Message ID: @.***>

dynamic-entropy commented 6 months ago

Hello @dciangot @belforte Is this still relevant?

belforte commented 6 months ago

AFAIK all Teemp RSE's work. Lets close this and open ad-hoc issues in case we find specific problems.

dynamic-entropy commented 6 months ago

Thankyou Stefano