Closed Andrew-McNab-UK closed 3 weeks ago
Comment1--are we guaranteed that Justin workflow counters won't reset in future? I.e will there only ever be one Justin workflow #1000? (or that multiple Justin instances might not have clashing workflow numbers?
Comment2--it will take a bit of work on the part of those who are watching the data coming back to its final destination to figure out which type of file is what in Justin-w1000s1p2 is it histograms, root files, etc.
Comment3--"(b) specified datasets which do not exist, in which case the default RSE expression (ie just DUNE_US_FNAL_DISK_STAGE) is used. " it is not clear if the default RSE expression to which you refer applies to (b) or to both (a) and (b). Also it appears that if per-RSE data sets are made then according to this description (1) there will be no rule pinning them to their initial destination, leaving them vulnerable to deletion and (2) the rule will have to be manually made on the aggregate dataset of all RSE-specific data sets. Do I understand this correctly?
re. (2) there are in principle two defaults, the default RSE location and the default lifetime of the rule.
In all of this the priority has to be to take all steps to avoid inadvertent loss of data. What we have found that if any step of that sequence of getting the data back is manual, someone will forget to do it at some point.
There's no default lifetime. If justIN is going to create the dataset, the user is forced to pick the lifetime themselves. In the current scenario, justIN checks at stage creation time if the dataset exists and if not, returns an error if no lifetime is given. ie the justin command displays it immediately to the user. In the new third scenario, where even the dataset name is not given, justIN will again return an error if no lifetime is given.
We would like to see this coded up because it could help considerably with the issues we are seeing with orphaned replicas at remote sites.. the issue of how and where the rules are made remains a potential stopping point. The last 2-3 months show that if the user (even expert production user) can do the wrong thing, they will. We need to be sure that files are protected by Rule at all stages of the lifetime.
This is included in 01.01
Request
This is similar to #137 which is for logs tgz files. Currently, if an output pattern for a stage within a workflow gives a Rucio dataset as the output location, then if that dataset does not exist when the workflow is promoted from submitted to running, then the justin-finder agent creates the dataset with the default rule of transferring the files to DUNE_US_FNAL_DISK_STAGE irrespective of which nearby RSE they are uploaded to by the wrapper job. Originally we attempted to create general rules to leave the files on the initial RSE by default and to reinstate this feature, the proposal is to create per-RSE datasets each with a rule to leave files on that RSE.
Implementation
The plan is to leave things unchanged for (a) pre-existing datasets and (b) specified datasets which do not exist, in which case the default RSE expression (ie just DUNE_US_FNAL_DISK_STAGE) is used. Instead, a new feature will be added to the --output-pattern option. Currently, the value of this option is PATTERN:DESTINATION where DESTINATION is a direct https URL prefix or a Rucio dataset. This will be extended so that if :DESTINATION is omitted then per-RSE datasets will be created, with names based on the workflow and stage ID.