Per-RSE output datasets for workflows/stages

Andrew-McNab-UK commented 4 months ago

Request

This is similar to #137 which is for logs tgz files. Currently, if an output pattern for a stage within a workflow gives a Rucio dataset as the output location, then if that dataset does not exist when the workflow is promoted from submitted to running, then the justin-finder agent creates the dataset with the default rule of transferring the files to DUNE_US_FNAL_DISK_STAGE irrespective of which nearby RSE they are uploaded to by the wrapper job. Originally we attempted to create general rules to leave the files on the initial RSE by default and to reinstate this feature, the proposal is to create per-RSE datasets each with a rule to leave files on that RSE.

Implementation

The plan is to leave things unchanged for (a) pre-existing datasets and (b) specified datasets which do not exist, in which case the default RSE expression (ie just DUNE_US_FNAL_DISK_STAGE) is used. Instead, a new feature will be added to the --output-pattern option. Currently, the value of this option is PATTERN:DESTINATION where DESTINATION is a direct https URL prefix or a Rucio dataset. This will be extended so that if :DESTINATION is omitted then per-RSE datasets will be created, with names based on the workflow and stage ID.

[ ] Each output pattern of a stage will be assigned an ID number, starting from 1, based on the order in which they are declared.
[ ] If no DESTINATION is given, then when the workflow goes from submitted to running, then per-RSE datasets will be created in MetaCat and Rucio with names of the form justin-w1000s1p2-RSENAME where 1000 is the workflow ID, 1 is the stage ID, 2 is the pattern ID, and RSENAME is the RSE name.
[ ] To make it easier to set rules at the dataset level, datasets will be created in MetaCat and Rucio containing the per-RSE datasets, with names of the form justin-w1000s1p2, justin-w1000s1, and justin-w1000
[ ] All these datasets will have relevant metadata about how they were created. eg justin-w1000s1p2 will have a key saying what pattern was used to find its files on disk in the jobs.

StevenCTimm commented 3 months ago

Comment1--are we guaranteed that Justin workflow counters won't reset in future? I.e will there only ever be one Justin workflow #1000? (or that multiple Justin instances might not have clashing workflow numbers?

Comment2--it will take a bit of work on the part of those who are watching the data coming back to its final destination to figure out which type of file is what in Justin-w1000s1p2 is it histograms, root files, etc.

Comment3--"(b) specified datasets which do not exist, in which case the default RSE expression (ie just DUNE_US_FNAL_DISK_STAGE) is used. " it is not clear if the default RSE expression to which you refer applies to (b) or to both (a) and (b). Also it appears that if per-RSE data sets are made then according to this description (1) there will be no rule pinning them to their initial destination, leaving them vulnerable to deletion and (2) the rule will have to be manually made on the aggregate dataset of all RSE-specific data sets. Do I understand this correctly?

Andrew-McNab-UK commented 3 months ago

Yes, for the production instance, we won’t recycle the workflow ID numbers. They are consistent back to autumn 2022. For the development and now integration instances, they reuse a gap between 500 and 999 in the sequence. I’ll make sure we don’t get dev vs int collisions and have a procedure for resetting things at reinstallation time.
If the dataset is specified and already exists, then justIN does not change it or create a rule. It is the responsibility of the person who created it to do that. Only if the dataset is specified but does not exist, does justIN create a rule, using the default. The bit about having per-RSE rules for the new per-RSE datasets is right at the end of the Request section. justIN won’t create any rules for the higher level datasets, which does mean rules can be added by the user once they are happy with the data, and before their chosen expiration time applied via the per-RSE datasets is reached.

StevenCTimm commented 3 months ago

re. (2) there are in principle two defaults, the default RSE location and the default lifetime of the rule.
In all of this the priority has to be to take all steps to avoid inadvertent loss of data. What we have found that if any step of that sequence of getting the data back is manual, someone will forget to do it at some point.

Andrew-McNab-UK commented 3 months ago

There's no default lifetime. If justIN is going to create the dataset, the user is forced to pick the lifetime themselves. In the current scenario, justIN checks at stage creation time if the dataset exists and if not, returns an error if no lifetime is given. ie the justin command displays it immediately to the user. In the new third scenario, where even the dataset name is not given, justIN will again return an error if no lifetime is given.

StevenCTimm commented 2 months ago

We would like to see this coded up because it could help considerably with the issues we are seeing with orphaned replicas at remote sites.. the issue of how and where the rules are made remains a potential stopping point. The last 2-3 months show that if the user (even expert production user) can do the wrong thing, they will. We need to be sure that files are protected by Rule at all stages of the lifetime.

Andrew-McNab-UK commented 3 weeks ago

This is included in 01.01

DUNE / dist-comp

Per-RSE output datasets for workflows/stages #138

Request

Implementation