Open benkrikler opened 10 years ago
I'll work on it. The third point means that a production run isn't guaranteed to use the same config file throughout.
The return code isn't checked because it's a submitted job, some extra hoops are required to check the exit status but it can be done. Any errored out job is marked as complete at this stage, and needs to be reset manually.
Point three makes it easier to change the modules file, but we can already get around this by editing the production.cfg file or making it a sym link.
If we think this of a big concern, could we implement it such that for true production runs, one of us preloads a modules file for each run into the table, and then the script forces its use by failing if something else is specified at the command line?
There have been some changes implementing these ideas. The changes have been bug-checked, though not exhaustively so. Specifically:
for outfile in $(sqlite3 ~/data/production.db "SELECT out FROM rootana_v2"; do ln -s $outfile ~/data/out/$(basename $outfile); done;
)--modules
, --dataset
, and --database
arguments are now accepted. The modules argument is ignored for alcapana production because of the way we use it (it's a compilation list hardcoded into the Makefile, so it can't be changed without jumping through other hoops first).--dataset
flags, if absent all are assumed.qacct -j <jobnumber>
command. However, once a job finishes it takes a few seconds to update whatever file qacct
checks, so sometimes this crashes the program. I've mitigated this as best as I can, but this may pop up as an AlCapException
at some point in the future. If a job has a nonzero exit code (which happens most often from uncaught exceptions in rootana such as bad_alloc
), it's registered as "unclaimed" in the database and a rerun attempt is made. There are no limits to the number of rerun attempts.I'm still working on something like a reset-run or reset-table tool.
Thanks John, this sounds awesome!
Could you explain how the --database
argument works?
It sounds like we need to sort out rootana's exit code as well now, right?
If the --database
argument (if it works, I tested it a bit but you know how these things go) takes the path of a database. The database must have the tables you're going to be using. So if you're starting a new alcapana production, it must have the datasets table. If you're starting a new rootana production, it must have the datasets table and the alcapana_v? table you're basing the rootana production off of. If you're continuing a production, that production table must be in there. At least, that was the goal.
This can all be accomplished by copying the main production DB and using the copy.
And the exit code should be set here, that's a good point. The only thing I check for is nonzero status.
Is this done?
I have checked for non-zero status, at which point the job is just resubmitted ad infinitum. The only thing not done is
- Add an option / tool to reset a given run / range of runs / dataset. Perhaps you have this already and I didn't find it though.
That's why I left this open.
@jrquirk, I raised some requests for changes to the run script tool before and thank you very much for implementing them. I had some other requests though, so I thought I'd write them up. Feel free to tell me they're ridiculous / not possible:
rootana/configurations/production.cfg
. The current file (rootana/production.cfg
) no longer exists in the repository.Also, am I correct that if a run fails the script tries to relaunch it? If so, how does it check failure? If it's using the exit code of rootana, we need to look at changing rootana to make sure a non-zero code is given when we don't finish successfully ( in particular when we bad_alloc due to large plots or otherwise, though this may already give us a non-zero exit code) .