Changes to run script - Githubissues

benkrikler commented 10 years ago

@jrquirk, I raised some requests for changes to the run script tool before and thank you very much for implementing them. I had some other requests though, so I thought I'd write them up. Feel free to tell me they're ridiculous / not possible:

Specify which production.db to use. This means if I want to do some runs myself I can still use the interface of this script which is so much simpler than writing my own stuff each time (and it is a seriously nice interface). I was thinking we could have a global and precious 'production' db which is when we really consider us running a production as we did for alcapana, and then our own local, more disposable databases for when we want to test something on a set of runs.
Specify the dataset on the command line. Mainly to save us needing to go into the db and prepare this by hand.
Allow the specification of which module file to use on the command line. This is similar to discussion towards the end of #166. I can always hack a way around this without needing to change the actual contents of the modules file (namely, sym-links), but a command line option would be far nicer.
Change the (default) modules file to be rootana/configurations/production.cfg. The current file (rootana/production.cfg) no longer exists in the repository.
Add an option / tool to reset a given run / range of runs / dataset. Perhaps you have this already and I didn't find it though.

Also, am I correct that if a run fails the script tries to relaunch it? If so, how does it check failure? If it's using the exit code of rootana, we need to look at changing rootana to make sure a non-zero code is given when we don't finish successfully ( in particular when we bad_alloc due to large plots or otherwise, though this may already give us a non-zero exit code) .

jrquirk commented 10 years ago

I'll work on it. The third point means that a production run isn't guaranteed to use the same config file throughout.

The return code isn't checked because it's a submitted job, some extra hoops are required to check the exit status but it can be done. Any errored out job is marked as complete at this stage, and needs to be reset manually.

benkrikler commented 10 years ago

Point three makes it easier to change the modules file, but we can already get around this by editing the production.cfg file or making it a sym link.

If we think this of a big concern, could we implement it such that for true production runs, one of us preloads a modules file for each run into the table, and then the script forces its use by failing if something else is specified at the command line?

jrquirk commented 10 years ago

There have been some changes implementing these ideas. The changes have been bug-checked, though not exhaustively so. Specifically:

Symbolic links are no longer created by the run scripts. They were inconsistent in which version they pointed to. People will have to create them manually (something like for outfile in $(sqlite3 ~/data/production.db "SELECT out FROM rootana_v2"; do ln -s $outfile ~/data/out/$(basename $outfile); done;)
The --modules, --dataset, and --database arguments are now accepted. The modules argument is ignored for alcapana production because of the way we use it (it's a compilation list hardcoded into the Makefile, so it can't be changed without jumping through other hoops first).
Multiple datasets can be specified with multiple --dataset flags, if absent all are assumed.
The dataset list is only checked when running, not constructing a new table. So all new production tables are constructed with all datasets, and then only those matching the command line argument are submitted to the grid.
The exit status is checked using the qacct -j <jobnumber> command. However, once a job finishes it takes a few seconds to update whatever file qacct checks, so sometimes this crashes the program. I've mitigated this as best as I can, but this may pop up as an AlCapException at some point in the future. If a job has a nonzero exit code (which happens most often from uncaught exceptions in rootana such as bad_alloc), it's registered as "unclaimed" in the database and a rerun attempt is made. There are no limits to the number of rerun attempts.
If a bunch of runs fail all at once, the job script will exit after printing the error messages. This is incidental behavior, but I'm not going to fix it because it seems like we'd probably want to stop production anyway.

I'm still working on something like a reset-run or reset-table tool.

benkrikler commented 10 years ago

Thanks John, this sounds awesome!

Could you explain how the --database argument works?

It sounds like we need to sort out rootana's exit code as well now, right?

jrquirk commented 10 years ago

If the --database argument (if it works, I tested it a bit but you know how these things go) takes the path of a database. The database must have the tables you're going to be using. So if you're starting a new alcapana production, it must have the datasets table. If you're starting a new rootana production, it must have the datasets table and the alcapana_v? table you're basing the rootana production off of. If you're continuing a production, that production table must be in there. At least, that was the goal.

This can all be accomplished by copying the main production DB and using the copy.

jrquirk commented 10 years ago

And the exit code should be set here, that's a good point. The only thing I check for is nonzero status.

AndrewEdmonds11 commented 10 years ago

Is this done?

jrquirk commented 10 years ago

I have checked for non-zero status, at which point the job is just resubmitted ad infinitum. The only thing not done is

Add an option / tool to reset a given run / range of runs / dataset. Perhaps you have this already and I didn't find it though.

That's why I left this open.

alcap-org / AlcapDAQ

Changes to run script #185