Closed belforte closed 3 years ago
@ericvaandering stupid question. Rucio rule is eventually a string, correct ? I ask becasue we'll want to store it in the DB and check periodically to release user task when data is on disk (as it was done for Dynamo). But the DB column used for storing Dynamo requestId takes an integer, so I suspect we can't mix.
It’s a hex-string. But I doubt that helps. e6a8a421c59f455c81369a153e9488cf is one being staged now.
On Oct 23, 2020, at 9:59 AM, Stefano Belforte notifications@github.com wrote:
@ericvaandering https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ericvaandering&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=kpEh71TYeCnmF9r00tm7bOXRLAnxuQHxyZHkEaQRPbI&s=cSBQrKXvE8rFawPdgU6l97O6BlTYQiys3UB0707l1DI&e= stupid question. Rucio rule is eventually a string, correct ? I ask becasue we'll want to store it in the DB and check periodically to release user task when data is on disk (as it was done for Dynamo). But the DB column used for storing Dynamo requestId takes an integer, so I suspect we can't mix.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CRABServer_issues_6210-23issuecomment-2D715395291&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=kpEh71TYeCnmF9r00tm7bOXRLAnxuQHxyZHkEaQRPbI&s=bRdeCwqcEknzSBffcR66EuUi7EImsLi-t-_86HLIV24&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_ADTXNNK2UUEUPSKQ6VKJ4CDSMGK5ZANCNFSM4S4VH5VA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=kpEh71TYeCnmF9r00tm7bOXRLAnxuQHxyZHkEaQRPbI&s=nz5hF3MvQFkPjC1dkW7oVtjGLDKazjl09zebwEaTe5k&e=.
could convert hex to base10 int, but I suspect it will be too large ! will think of something.. at worst, we have the dataset name in the DB and can query by that. Or maybe our friendly DBA can remove old data from the column and change the data type to string.
@ericvaandering @nsmith- I am starting to work on automatic tape recall via Rucio. following example in https://github.com/dmwm/CMSRucio/blob/master/DMOps/StageDatasetForUser.py
If you see rule requests from belforte
or from crab_server
(I will stick to ask_approval=True) ignore them. We will worry about policy once code is working. I do not expect to be able to authenticate as user crab_tape_recall
atm.
If I find that I can't progress w/o actual requests being approved, I'll let you know.
Thing is we don’t approve rules, site managers do. So maybe use a friendly site?
Sent from a mobile device.
I hadn't noticed that a site is needed. No problem for testing, I hope I have a few friends, but when doing for real I do not think CRAB is in a good position to pick destination site(s).
Well, when doing it for real there will be no approval. The current thing does a scatter across all good sites and I suggest keeping that.
I assumed you were using approvals to make sure no data was recalled. In fact, a rule with approval requires just one RSE as otherwise it makes no sense. One person could approve a rule which wrote to another person’s site.
On Dec 2, 2020, at 7:36 AM, Stefano Belforte notifications@github.com wrote:
I hadn't noticed that a site is needed. No problem for testing, I hope I have a few friends, but when doing for real I do not think CRAB is in a good position to pick destination site(s).
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CRABServer_issues_6210-23issuecomment-2D737234117&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=dWQVn1OD87QmEu7qppVfUzIpYvwxgdQpM78SbDqsptA&s=UInl8nuvZSStgYsKV7XjVJe2CL8Jvbp08An3udE4SS8&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLWDUWY6NJ35AOYUM4TSSY7FRANCNFSM4S4VH5VA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=dWQVn1OD87QmEu7qppVfUzIpYvwxgdQpM78SbDqsptA&s=Z_fuFd65fn06FOWt37cXF25O4FWFyWOOr0CAtb48hBs&e=.
thanks @ericvaandering indeed I figured that out when usiing your RSE_EXPRESSION and getting:
Error: Provided replication rule is considered invalid. Details: Ask approval is not allowed for rules with multiple RSEs
so I tried RSE_EXPRESSION = 'T3_IT_Trieste' and various variations taking inspirations from https://rucio.readthedocs.io/en/latest/rse_expressions.html but none worked. What shall I put there ?
Well... I can probably have fun writing rules targeted for T3_IT_Trieste, but ..what's the point ? Yes, I need some way to check all steps, w/o actually triggering tape retrieval. I need to see what one realistic rule looks like in order to figure out what to report to user and how to track it in the automatic machinery, and what excptions look like. We need to define if all requests are accepted or there should be limits (in Rucio ? in CRAB?) Maybe use a Rucio test instance to post fake requests ? First decision point: we used to pass a list of blocks to Dynamo. Do you still want that, or a full dataset name ?
Maybe I need some special quota when testing with Trieste ? I have some disk quota there, yet:
Error: There are not enough target RSEs to fulfil the request at this time.
Details: Target RSE set not sufficient for number of copies. (1 copies requested, RSE set size 0)
OK, lots of small issues.
The account which we are using to script this does not need quota and that should be the account you use in the future.
Approval will not be needed in the end. Maybe you want to test with it now
RSE expression will be as in the code above, spreading the data across all Tier2s equitably.
You may need to include rse= in your expression:
-bash-4.2$ rucio list-rses --expression rse=T3_IT_Trieste
T3_IT_Trieste
thanks Eric, I sorted out the RSEExpression thing, but then I still fail to make the rule with T3_IT_Trieste, dont' know why. As you said.. many small thing, in the end I do not have a clear way to test and learn by trying and not enough detailed specification to commit code to production and turn it on. I do no know how to progress...
OK. this kind of works, I had to change some things from your example, e.g. set weigth=None
I use Alan's WMCore wrapping, sort of out of politeness, but since all it does is to convert a
list of names into a list of dids ... basically it saves one line. I am quite tempted of stopping
using WMCore wrapper everywhere in CRAB, but that's a different topic.
# createReplicationRule(self, names, rseExpression, scope='cms', copies=1, **kwargs):
rules = rucioClient.createReplicationRule(blocks, 'T3_IT_Trieste', scope='cms', copies=1,
weight=None, lifetime=DAYS, account='belforte',
activity='Analysis Input', comment='Staged from tape for %s' % username,
ask_approval=True, asynchronous=True,
)
Which created 4 rules, one per block ! I can see the rational for 4 rules instead of one, so the question for @ericvaandering and @nsmith- is
side note.
I will store the rule in CRAB DB so that we can automatically check completion and then release submissions. Given that is a string like 1a31bb9828a34657a34d72258c6e5173
I will store a VARCHAR, not and INTEGER, but still need to set a maximum length. The one above is 33 chars. What more should I be ready to accept ? Can the rule be 400chars ?
I think its ok to create a rule per block when the whole dataset is not desired. If the whole dataset is to be recalled, maybe an optimization is to make just one rule on the container DID. The rule id is a 32 char hex string always, including the one you pasted:
>>> len("1a31bb9828a34657a34d72258c6e5173")
32
thanks @nsmith- I will try to make a container with those blocks, so there's only one rule to track. thanks for fixing my counting a LF as part of the string !
echo 1a31bb9828a34657a34d72258c6e5173|wc
1 1 33
vs.
echo -n "1a31bb9828a34657a34d72258c6e5173"|wc
0 1 32
hm I would prefer not to create new container DIDs but just making multiple rules. This is how WMCore does it, is that possible?
creating is possible. I have to find some way to track them in association to a given task. e.g. CRAB could print:
dear user, a data recall request was created for you, monitor it via
rucio list-rule 1a31bb9828a34657a34d72258c6e5173 (or whatever)
and submit again once data are on disk
but if I print a list of 40 hex strings.. few people will be happy !
Is one rule per block really an advantage ? Is that so that WMA can check as single blocks are available and start processing them ? That's too much to ask of CRAB.
I guess I can turn the existing DB column from NUMBER(38) to CLOB, rather then VARCHAR(32) and then can store the list of rules. As Oracle says
A CLOB (character large object) value can be up to 2,147,483,647 characters long.
But I am not going to try to manage them individually.
My concern is that making a DID (under the user scope? definitely not cms scope) each crab job may get taxing, and the DID is thrown away afterwards. It is unique even after deletion so you have to make up a new name each time.
it is one per task, not per job. names are cheap as FKW used to say, and maybe I make them in crab_server account scope, so I do not need to create another rucio client with user credential and it gets easy to track how much we stage. What else could be a problem ? Overlapping requests from multiple users ? Conflict with existing rules ? Rules duplications ? (I see Alan has code to deal with duplicated rules... dunno why). What could go wrong ? Single rule, multple rules... OK, a few lines more of code but I can do everything.
Ok, under crab accout scope seems reasonable. Then we leave the container DID around? I think there is no harm in that, plus I think we need to carefully see what happens if we did delete the container DID (does it cascade? that would be very bad in this context!) What would the container DID name be? Could be nice to embed the task name. Then a user could eve rucio list-rules crab_server:/Input/task_name/USER
or whatever, instead of keeping track of hex rule ID.
we need to start exploring with containers anyhow... may even be fun. Are there naming constraints ? If we need names which are "DBS compatible" I have to check what CRAB and DBS allow now. And likely need to replace ':' in current task names. If this DID will never make it to DBS (and it shouldn't) your name is great.
Rucio constraint is https://github.com/ericvaandering/rucio/blob/cms_nano2/lib/rucio/common/schema/cms.py#L67
copied here: r'/[a-zA-Z0-9\-_]{1,99}/[a-zA-Z0-9\.\-_]{1,199}/[A-Z\-]{1,50}'
I don't think this should ever make it to DBS.
looks suspiciously similar to WMcore/Lexicon.py :-)
PRIMARY_DS = {'re': '^[a-zA-Z][a-zA-Z0-9\-_]*$', 'maxLength': 99}
PROCESSED_DS = {'re': '[a-zA-Z0-9\.\-_]+', 'maxLength': 199}
TIER = {'re': '[A-Z\-_]+', 'maxLength': 99}
other then allowing dids to start with entertaining strings like /_____-----__/
(an oversight?)
And no colon ':' allowed (no idea if there's a technical reason, breaks SQL ? or it was just a whim, I wasn't part of that).
well.. two dots..one dot... anything goes. Thanks.
Container names follow the lexicon for CMS datasets. No accident there. Colon is probably not possible since it’s used in rucio as a delimiter between scope and name.
And we will have to make a scope for CRAB. I’d suggest “crab”. If you encode the task name in the container, that’d be helpful, probably.
Deleting a container, should we ever need to, should have no impact.
Sounds like a plan. Onward!
can't test with scope 'crab' due to https://github.com/ericvaandering/rucio/blob/0e7df0d1f489302fe011dcef28120db83ff2b2ad/lib/rucio/common/schema/cms.py#L59
Details: Problem validating did : u'crab' does not match '^(cms)|(user\\.[a-z0-9-_]{1,20})$'
But I think that I can use scope='user.crab_server'
which is the account used by CRAB TaskWorkers.
So far am sticking to user.belforte
and rseexpression=T3_IT_Trieste
as playground.
hmm Eric, Nick, what's the python equivalent of
rucio attach user.belforte:/TapeRecall/201120_131722.belforte_crab_20201120_141717/USER cms:/MuonEG/Run2016B-v1/RAW#86bc5e3e-1519-11e6-a3f4-001e67ac06a0
? because that works (as per the twiki) and if I try to attach it again I obtain a sensible error
2020-12-07 15:54:34,195 ERROR Data identifier already added to the destination content.
Details: [u'(cx_Oracle.IntegrityError) ORA-00001: unique constraint (CMS_RUCIO_PROD.CONTENTS_PK) violated']
But when I try from python (and I tried attach_dids, add_datasets_to_container, add_containers_to_container... all of them call attach_dids eventually [1]) I always get (both with the existing block/did or with a new one):
RucioException: An unknown exception occurred.
Details: [u'(cx_Oracle.IntegrityError) ORA-02290: check constraint (CMS_RUCIO_PROD.CONTENTS_CHILD_TYPE_NN) violated']
and cound not find a way to replicate the CLI succes. I can list container status finely from python, just to show that I have some idea of what I am doing.
[1] example
Existing examples from Alan's WMCore seems to indicate that attach_dids should work.
Your last example won’t work because you are not attaching a container to a container but a block (rucio dataset) to a container.
I would expect that this would work just fine with the same set of parameters:
https://rucio.readthedocs.io/en/latest/api/did.html#rucio.client.didclient.DIDClient.attach_dids https://rucio.readthedocs.io/en/latest/api/did.html#rucio.client.didclient.DIDClient.attach_dids
I’d always suggest naming the parameters on the call for clarity and to get away from ordering issues.
Also, in the 2nd example, scope is ‘cms’ not ‘cms:’
Eric
On Dec 7, 2020, at 9:15 AM, Stefano Belforte notifications@github.com wrote:
hmm Eric, Nick, what's the python equivalent of
rucio attach user.belforte:/TapeRecall/201120_131722.belforte_crab_20201120_141717/USER cms:/MuonEG/Run2016B-v1/RAW#86bc5e3e-1519-11e6-a3f4-001e67ac06a0 ? because that works (as per the twiki) and if I try to attach it again I obtain a sensible error
2020-12-07 15:54:34,195 ERROR Data identifier already added to the destination content. Details: [u'(cx_Oracle.IntegrityError) ORA-00001: unique constraint (CMS_RUCIO_PROD.CONTENTS_PK) violated'] But when I try from python (and I tried attach_dids, add_datasets_to_container, add_containers_to_container... all of them call attach_dids eventually [1]) I always get (both with the existing block/did or with a new one):
RucioException: An unknown exception occurred. Details: [u'(cx_Oracle.IntegrityError) ORA-02290: check constraint (CMS_RUCIO_PROD.CONTENTS_CHILD_TYPE_NN) violated'] and cound not find a way to replicate the CLI succes. I can list container status finely from python, just to show that I have some idea of what I am doing.
https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_1837785_101366364-2D7d037700-2D386a-2D11eb-2D9e3e-2Ddd451cc245fd.png&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=fQSixbBkB03o8shwVEeHmdY0pHcduV0v7T2UmqT9yJQ&s=xgxKwfIcRf5vW1lK2xvWdOYGkaQl1exc6cLMYqf0Odg&e= [1] example https://urldefense.proofpoint.com/v2/url?u=https-3A__user-2Dimages.githubusercontent.com_1837785_101366794-2D01ee9080-2D386b-2D11eb-2D91ba-2D6c049c797f94.png&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=fQSixbBkB03o8shwVEeHmdY0pHcduV0v7T2UmqT9yJQ&s=0aSdAtOwqIp8QbZXK00AnqIvA_XA0wgbISiNjykC9EQ&e= Existing examples from Alan's WMCore seems to indicate that attach_dids should work.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_dmwm_CRABServer_issues_6210-23issuecomment-2D739981153&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=fQSixbBkB03o8shwVEeHmdY0pHcduV0v7T2UmqT9yJQ&s=Ct2OuFhqYI5BYZkXu31nd7MJXShc00YXAKWAFTRfqQI&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAMYJLWNPUZXL3T4XA7JZ53STTWSZANCNFSM4S4VH5VA&d=DwMCaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=EHaoB-POFWGrYFvPXoj1bQ&m=fQSixbBkB03o8shwVEeHmdY0pHcduV0v7T2UmqT9yJQ&s=BwRigbokzvBqbfCqsSTk_-IExADpLzx5VBdQXDo7Rbk&e=.
that was it Eric !!!! Thanks !!!! Well spotted !!!!
Also, in the 2nd example, scope is ‘cms’ not ‘cms:’
everything else you wrote does not apply, including that AFAICT native Rucio API has names only for some of the parameters, maybe to indicate that those must be present, so requiring proper ordering... Alan followed same practive in WMCore wrappers. But I am not sure that at least in the limited CRAB use cases that we gain anything by using WMCore wrapping, other than obfuscation of the original API. A few utiity functions to massage output or prepare input according to your need are fine, but a wrapping layer that exposes some, but not all, the inner functionalities ... hmmm...
P.S. I do not find methods to delete containers.
This is the relevant CLI command: https://rucio.readthedocs.io/en/latest/man/rucio.html?highlight=erase#erase Looking at how it is implemented, it is with a rather cryptic metadata change:
client.set_metadata(scope=scope, name=name, key='lifetime', value=86400)
basically setting the DID lifetime to 1 day.
thanks Nick. Now.. what would be a good policy here ? is the container did only useful until the rule requesting a disk copy is created ? (i.e. the rule becomes a rule for the individual datasets(aka block) or files?) Or do we need to keep the container around until ready to let disk replicas be removed from disk ?
indeed a container has a lifetime. But the concept is not elaborated upon in the documentation. A lifetime for a rule is sort of clear. Now. if a rule affects multiple containers with multiple life times.. the actual behavior is not defined.
I suppose setting the DID lifetime would have to imply a lifetime is set on any rules applied to it? Did we confirm that deleting a custom container DID does not cascade delete? For CRAB tasks that plan to analyze the whole CMS dataset, there's no need to create a new containers right? Will you just use the existing DID?
Hi @nsmith- for the scope of this, I expect that we can ignore DIDs lifetime, live with the defaults and if by "summer" we have reasons to be willing to remove old containers created by CRAB (let's how many and how harmful), we can dwell in details. So far I see this mostly as a learning topic. What does it exactly mean that a DID has a finite lifetime ? I was not planning to have things like "if full dataset .. else.." and simply have a uniform code which starts from a list of blocks, fits better code that was written for Dynamo. But of course everything is possible, simply the more I change, the higher chances of introducing bugs. I am looking for a way out of CRAB maintenance rather than jobs security :-) Since at times a new container is needed, let's make sure we know how to deal with it.
the recall request submission was introduced with https://github.com/dmwm/CRABServer/pull/6322 and now being tested in https://github.com/dmwm/CRABServer/releases/tag/v3.210108 but still with preliminary, test values for Rucio account, scope, destination RSE. Tasks will be put in SUBMITFAILED and users will have to monitor the rule progress and submit again when OK.
Once finalized and we'll have rule ids in CRAB TASKDB to track progress of, I'll work on automatic task resubmission.
I think that all work on TW side has been done, besides the automatic resubmission, and is now in https://github.com/dmwm/CRABServer/releases/tag/v3.210108p2 deployed on my VM and on DEV instance of CrabServer. But while it works on stefanovm, it does not work in the server yet due to authentication issues in RUcio. Now tracked in https://github.com/dmwm/CRABServer/issues/6332
We can progress to put CRABServer REST v3.210108p2 in PreProd and Production. That is not depending on all Rucio things, it is simly to store the ruleId as a 32-char strings instead of a number (as it was the case for Dynamo)
The request submission part is done with tag v3.210110 for the automatic task release I will open a new Issue since this has got too long.
will want to pick code examples from https://github.com/dmwm/CMSRucio/blob/master/DMOps/StageDatasetForUser.py as per following mail from Eric