Closed hufnagel closed 12 years ago
metson: Milestone T0 2_0_0 deleted
hufnagel: Include the job splitting parameters in the WMSpec, including the jobNamePrefix, which should be "Repack-Run
hufnagel: To get the ball rolling on this, there are open questions about a missing Config.DP method that gives you a repack config and also in general about how to configure the multiple outputs from repacking.
For the first version of this, you could just use a merge configuration for both the repack and the repackmerge part. The 'repack' job would get a merge configuration from Config.DP with a single output (single dataset), but otherwise it would be configured like a processing job, with normal unmerged output and support for direct-to-merge. We can also use this for developing the support for the error datasets (two output modules in the spec, one for normal, one for error dataset and selecting one of them at runtime based on information passed from the job splitter).
hufnagel: Another thing we could also already cover with this 'fake' repack spec is #3124.
evansde: I have commit rights in Conf/DP, lets just write a repacking method in there and go with that.
def repack(self, whatArgsGoeHere, *streamers): process = cms.Process("Repackappottamus") ... return process
sfoulkes: Working on this. There are a couple things I guess about that we need to firm up:
I'll have to do 2 patches, one to bring the Repack spec in the T0 repo up to date and one to remove all the assumptions we have in WMBase that all workflows only produce a single primary dataset.
sfoulkes: We should also fix 2949 and verify that the dataset naming in StdBase is OK.
sfoulkes: I'm also unsure of how to error datasets work. I set the merge task up to have two output modules: the regular one and a "MergedError" output module. The primary dataset in the MergeError module has "Error" appened to it. One of these will be turned off at runtime depending on the size of the input data.
sfoulkes: The WMSpec stuff doesn't like complex types as values, so my SelectEvents datastructure isn't going to fly. We'll have to make the repack method in Conf/DP something like: def repack(self, globalTag, **selectEvents)
hufnagel: There is already support in ConfigBuilder to pass in dictionaries to select output modules. I haven't gotten around to testing it yet though. Would look something like this
outputs = [ { 'dataTier' : 'RECO', 'selectEvents' : 'HLT:path1, HLT:path2' }, { 'dataTier' : 'RECO', 'selectEvents' : 'HLT:path3, HLT:path4' } ])
However we package this, the content needs to be passed to Config.DP. That is , a list of output modules and for each output module the tier and the selectEvents string.
hufnagel: Btw, the determination when to use the errorDataset output isn't on size primarily. Output only goes to the error dataset if we have to breakup a lumi and the decision whether to breakup a lumi is done at job scheduling time. When that decision is made, the information somehow needs to be stored with the job and passed to the runtime environment.
sfoulkes: Replying to [comment:19 hufnagel]:
Btw, the determination when to use the errorDataset output isn't on size primarily. Output only goes to the error dataset if we have to breakup a lumi and the decision whether to breakup a lumi is done at job scheduling time. When that decision is made, the information somehow needs to be stored with the job and passed to the runtime environment.
This decision happens in the RepackMerge splitting algo, right? The spec creates the RepackMerge jobs with normal output modules and error output modules. Then one of the modules is turned off at runtime. We don't have to do anything with error datasets in the Repack tasks themselves, right?
hufnagel: Yes, the decision is made in the repackmerge splitting algo, because only it has the relevant Tie0configuration related parameters (whether to split at all and at what thresholds) and the information about which files belong to which lumi and the sizes of each.
The actual repackmerge job itself and it's CMSSW configuration has one output module, the only thing we need to change at runtime is whether the resulting file is accounted for the normal dataset or the error dataset.
Not sure what the best way to do this is. I though (from our previous discussions) that we could overload the single CMSSW output with two output definitions in the spec, one for the normal and one for the error dataset and then remove the one we do not need at runtime (or at the very least only use the one we need).
sfoulkes: Initial Repack Spec is attached. There's not a lot going on, almost everything is already handled by the setupProcessingTask() and setupMergeTask() methods in StdBase. Questions for Dirk:
sfoulkes: Second patch contains changes to WMCore. It's about 50% cleanup, 25% better support for more involved Config.DP configs and 25% support for the more elaborate T0 merging. Dave, could you review this?
sfoulkes: (In 15201) Modify StdBase so that it doesn't assume that all workflows have only run over a single primary dataset. Modify the addMergeTask() method to support error datasets. Minor cleanup in the other specs. Fixes #1796.
From: Steve Foulkes sfoulkes@fnal.gov
hufnagel: Still need to look at the changes in the T0 code.
hufnagel: De-scope this a bit to get something working more quickly. The version attached is a fully featured repacking, but instead of the repack merge with the active split lumi protections and with error dataset support I am using a standard merge for now.
hufnagel: Please review (both #1796 and #3578) by checking that the RunConfig and Tier0Feeder unit tests work
mnorman: Tested in conjunction with #3578
hufnagel: (In eae1a75ab7113c5181c44939fb4f781be62f9863) Create Repack WMSpec, fixes #1796
Signed-off-by: Dirk Hufnagel Dirk.Hufnagel@cern.ch
We need a WMSpec that will run the repacking. It needs two tasks, first the actual repacking and then a merge step.
Each Repack WMSpec is stream specific. Embedded in the WMSpec is the dataset to trigger path mapping for the given stream. This is passed at runtime to Configuration.DataProcessing and returns a valid repacking configuration. This system is not commissioned yet, so for early testing we can also make up a repacking configuration, store it in the ConfigCache and embed the id in the WMSpec.
The after repacking merge step can not be implemented as a standard WMCORE merge step. This is because of error datasets. If repacker size protections kick in, we need to decide at merge time whether the output goes to the normal dataset or an error datasets. The way we'll implement this is with a custom repack merge job splitting algorithm that passes the normal/error dataset decision to the job. At runtime the job then evaluates this flag and configures one of the day normal/error dataset output modules. Both output module to fileset mappings need to be defined in the WMSpec though.
Requires #2481 and #3096