Rootana production problems

jrquirk commented 10 years ago

Here I think we can walk through this first production together. The problem I'm fixing now:

Submitting run: 3450
3450
ERROR: There was an uncaught exception. Aborting...
Cancelling job 1030825
Cancelling job 1030826
Cancelling job 1030827
Cancelling job 1030828
Cancelling job 1030829
Cancelling job 1030830
Cancelling job 1030831
Cancelling job 1030844
Cancelling job 1030833
Aborting runs: [3450, 3286, 3287, 3288, 3289, 3290, 3291, 3292, 3326, 3449]
Traceback (most recent call last):
  File "./run_production.py", line 172, in <module>
    sub = runman.SubmitOne()
  File "/gpfs/home/quirk_j/AlcapDAQ/analyzer/batch/jobscripts/RunManager.py", line 150, in SubmitOne
    job = mu.submit_job(run, self.prod, infile)
  File "/gpfs/home/quirk_j/AlcapDAQ/analyzer/batch/jobscripts/merlin_utils.py", line 131, in submit_job
    con = subprocess.Popen(cmd, stdout=subprocess.PIPE)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
TypeError: execv() arg 2 must contain only strings

It seems this tree file doesn't exist. So I'm marking them as having an error status (E) in the database.

Edit: I think an error like this when a file doesn't exist is okay, because it should've existed. Runs 3450-3454 are now marked as 'E'.

Edit: Runs 3771 and 3772 are missing as well.

Edit: And 3770

benkrikler commented 10 years ago

Has production started then or are you just testing? I haven't had a chance to sort out the plotting modules. I was about to do it now and hopefully it won't take me long so if we can hold off actually starting for a few hours that would be good.

jrquirk commented 10 years ago

I had actually started. But we can start again whenever. It's not a big time commitment to start it. All of a minute.

benkrikler commented 10 years ago

Ok. I'll be as quick as I can.

benkrikler commented 10 years ago

Here's a problem for us:

gAnalysedPulseMap which we've been using to store the TAPs is still StringAnalPulseMap. Somehow it was never changed to be a SourceAnalPulseMap, meaning that the keys are just strings. Moreover, those strings are currently being filled with just the detector name so when we run with multiple generators we're overwriting the TAPs of the previous generator to look at a channel. That was the whole point of the IDs::source class, yet we never actually switched to using them.

I'm going to change this over, but I'll have to go through all code that accesses gAnalysedPulseMap and reflect this change. To be continued no doubt....

benkrikler commented 10 years ago

Facepalm moment coming:

We have no mechanism to write the TDetectorPulses or TMuonEvents to file right now, so these won't appear in the output files. I can probably copy over the TAnalysedPulseMapWrapper code quite quickly if we want. It's lightweight stuff so it shouldn't be too difficult.

benkrikler commented 10 years ago

I've gone through all code and swapped gAnalysedPulseMap to be of SourceAnalPulseMap type. Everything is compiling ok and seems to run fine, though it's hard to do a thorough check. I guess that's the point of the run.

The PulseViewer however is now having troubles because we need to specify a source string, but since we're not using this module for the production I'll leave it for now. Shouldn't be too hard to fix anyway.

I'll finish off the PlotMuScTDiff module, allow MakeAnalysedPulses to run with multiple default generators, quickly test it and then push everything.

AndrewEdmonds11 commented 10 years ago

Edit: I think an error like this when a file doesn't exist is okay, because it should've existed. Runs 3450-3454 are now marked as 'E'.

We do have trees 3450 - 3454...

Edit: Runs 3771 and 3772 are missing as well.

...and 3771 (not 3772)...

Edit: And 3770

...and 3770

What's happening here?

benkrikler commented 10 years ago

I've done the best I can, but there's been a series of bugs I've found with each one leading to another.

I've published the feature, but not yet merged it. Here's a summary of the changes I've had to make:

PlotAmplitude works fine making 1 hist per source
PlotTime works fine making 1 hist per source
new PlotAmpVsMuScTDiff module. The PlotAmpVsTDiff did the TDiff between 2 arbitrary channels. This new one does the Amp Vs TDiff with one plot for each TDiffs between all non-muSc sources compared all muSc sources
Made gAnalysedPulses type SourceAnalPulseMap so we can really have multiple generators per channel and propagated this change through all code.
MakeAnalysedPulses options default_fast and default_slow able to take multiple generators like the specific channels already could
MakeDetectorPulses handles TAPs for multiple sources per channel properly and uses ID classes instead
SetupNavigator can return the bank name for a given Channel ID
IDs::channel has a method to obtain the ID of the corresponding fast / slow channel

I think this is now in a state to run, but unfortunately I've not been able to fix an inifite loop problem in the MaxTimeDiffDPgenerator algorithm. That means we can't run with MakeDetectorPulses and therefore not MakeMuonEvents this time. I'll open another ticket for this problem to be solved in the future. It was previously not detected because it seems MakeDetectorPulses was not obtaining the pulse lists properly. EDIT: And as I said above there's no way of writing TDPs or TMEs to file yet so there's not much point in running with them unless it's to test timing / performance.

All of these problems would be easier if we had a functioning event navigator, hint Phill hint.

benkrikler commented 10 years ago

I'm going to merge all this on to develop if no one objects. Then we can resume production?

AndrewEdmonds11 commented 10 years ago

Well, then we'll have to merge it onto master and tag it (v1.1?). Then, looking at John's instructions, it's just a case of changing this:

./run_production.py --production=rootana --version=1

to this:

./run_production.py --production=rootana --version=1.1

AndrewEdmonds11 commented 10 years ago

And also working out what's wrong with the missing files...

benkrikler commented 10 years ago

I'll merge this into develop then but I should hold from tagging another production version while we sort the missing files?

AndrewEdmonds11 commented 10 years ago

I think so. I don't entirely understand what the problem is, so I think we should wait for John

jrquirk commented 10 years ago

A note on how the production looks for the files. The tree locations are saved in the production database (/gpfs/home/$USER/data/tree/$VERSION/$FILE) except this time where the VERSION just isn't a thing.

We do have trees 3450 - 3454...

I thought they'd be in Joe's directory and didn't bother checking any others. I found them now.

Well, then we'll have to merge it onto master and tag it (v1.1?)

The production scripts only allow integer version numbers, so this would be two in the production database. If a new version is started it will be version 2 (it's automatically incremented). We could change it, but I don't think it matters. To start a new production a single person should type

./run_production.py --production=rootana --new=git_software_tag --version=3

This would say start a new version of rootana production, making note of what software tag it uses, and use alcapana version 3 files. It will print out something like "Starting new version ##". Then, if ## is 2 (as it should be) people can run

./run_production.py --production=rootana --version=2

And this says continue with rootana production version 2. Here it's clear the version argument has a different meaning if the --new argument is passed (which can also be changed if people want). Anyway, people can also just run

./run_production.py --production=rootana

and it will just look for the most recent version and run that. I didn't mention that before because I kind of regret putting that functionality in. The more problems we run into the more I realize @litchfld 's opinion of failing loudly is almost always preferred, and this means maybe not having ambiguous defaults.

AndrewEdmonds11 commented 10 years ago

Thanks, John. Let us know when we can restart production.

benkrikler commented 10 years ago

I've just merged my feature branch in and quickly tested it. It looks reasonable to me.

I also updated the production.cfg file to remove the MakeDetectorPulses and MakeMuonEvents modules, add in the plot modules and use only one MakeAnalysedPulses module. @jrquirk maybe check that it looks ok to you first?

jrquirk commented 10 years ago

I'll give the file a quick eyeball check. Stay tuned.

Also, of the files I thought were missing, Andy pointed out one actually is. I'm marking it as 'E' and we'll deal with it.

I'm also getting rid of the last production data table. So this next version once started will be version 1 in the production table, but will be tagged with v1.1 software version.

benkrikler commented 10 years ago

Could we add an integer sub-version column to the table? We had a couple of false starts with alcapana as well, so it might make life easier.

jrquirk commented 10 years ago

Yes, but I'm going to wait on that. I know I don't want version to be a float, but dot-separated ints like that are okay. This will take a few lines of code and a few tests.

jrquirk commented 10 years ago

Okay, it's ready. Run

$ git fetch; git checkout rootana_v1.1
$ ./run_production --production=rootana --version=1 --display

Display adds a curses display. You can also not use it by omitting the flag.

There will be errors, as always, so let's keep this open.

benkrikler commented 10 years ago

Don't forget to recompile rootana!

AndrewEdmonds11 commented 10 years ago

Which version of alcapana do we need to compile? The one in the data_quality directory? Or the $USER directory?

jrquirk commented 10 years ago

The one in your user directory. Source

benkrikler commented 10 years ago

~~As long as they all provide the same version of TSetupData, TGlobalData and TPulseIsland I don't think it matters.~~ EDIT What John says sounds safer.

AndrewEdmonds11 commented 10 years ago

Ok, I've managed to get it working. Will let you know if it crashes

jrquirk commented 10 years ago

When I try to run the display under Jim's account, I get some out-of-bounds exception. But it still works without the display flag.

benkrikler commented 10 years ago

I think I'm seeing the same error John described for Jim's account:

$ ./run_production.py --production=rootana --version=1 --display
                                    File "./run_production.py", line 146, in <module>
                                                                                         screenman.SetProgram(production)  
                                                                                                                           File "/gpfs/home/krikler_b/AlcapDAQ/analyzer/batch/jobscripts/ScreenManager.py", line 77, in SetProgram
                                                                                        self.Draw()  
                                                                                                     File "/gpfs/home/krikler_b/AlcapDAQ/analyzer/batch/jobscripts/ScreenManager.py", line 188, in Draw
                                                             self.screen.addstr("%-80s" % self.messages[self.print_msg])
                                                                                                                        IndexError: list index out of range

The spacing is as it appeared for me, so the line wrapping being funny suggests it's upsetting the columns. It also breaks readline so I need to do reset to get control back of the terminal.

jrquirk commented 10 years ago

When you use ncurses, it's common to turn off "echo" in the shell. Upon successful exiting echoing is reenabled. Upon failed exiting echo is usually not. Hence why we always need to type "reset".

benkrikler commented 10 years ago

When you use ncurses, it's common to turn off "echo" in the shell. Upon successful exiting echoing is reenabled. Upon failed exiting echo is usually not. Hence why we always need to type "reset".

I learn something new every day on Alcap....

jrquirk commented 10 years ago

We're getting std::bad_allocs from we think the AmpvTDiff histos, so the offending module has been removed from production.cfg and version 1.2 of rootana has been tagged. I need to reset the production database now though. And then we can restart. The display problem is also fixed.

benkrikler commented 10 years ago

Do we need to do any particular cleanup? Delete the old output files etc?

jrquirk commented 10 years ago

Oh yeah. Remove the logs and output files in log/ and out/.

jrquirk commented 10 years ago

I'm still getting a couple bad_allocs. And jobs are sitting in the queue for over an hour and a half. I'm stopping for now. I think we should profile this some.

AndrewEdmonds11 commented 10 years ago

Ok. I've restarted.

Nice display BTW

AndrewEdmonds11 commented 10 years ago

I've now run out of space but I did manage to do 230 runs.

litchfld commented 10 years ago

Which version of alcapana do we need to compile? The one in the data_quality directory? Or the $USER directory?

The one in your user directory.

~~As long as they all provide the same version of TSetupData, TGlobalData and TPulseIsland I don't think it matters.~~ EDIT What John says sounds safer.

Now that we've stopped doing active development of alcapana perhaps it would make sense to put this stuff in a standard location rather than .../$USER/...

jrquirk commented 10 years ago

@AndrewEdmonds11 how many of your runs failed due to bad_alloc?

AndrewEdmonds11 commented 10 years ago

The only run with an error log with something in it is run 3257. Wasn't the bad_alloc error from v1.1?

jrquirk commented 10 years ago

Yes, but even so I still had 2 of the six or eight runs that actually made it through to the queue fail.

AndrewEdmonds11 commented 10 years ago

Ok. Here's what I had in the error log:

 *** Break *** segmentation violation

===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0  0x00000037fe2ac8be in waitpid () from /lib64/libc.so.6
#1  0x00000037fe23e909 in do_system () from /lib64/libc.so.6
#2  0x00002b3472fab998 in TUnixSystem::StackTrace (this=0x1e4ba50) at /gpfs/home/edmonds_a/AlcapDAQ/root/core/unix/src/TUnixSystem.cxx:2403
#3  0x00002b3472faa813 in TUnixSystem::DispatchSignals (this=0x1e4ba50, sig=kSigSegmentationViolation) at /gpfs/home/edmonds_a/AlcapDAQ/root/core/unix/src/TUnixSystem.cxx:1279
#4  <signal handler called>
#5  0x000000000046448b in IslandAmplitude::AfterLastEntry (this=0x2661b00, gData=<value optimized out>, setup=<value optimized out>) at src/data_quality/IslandAmplitude.cpp:157
#6  0x000000000041c7b5 in Main_event_loop (dataTree=<value optimized out>, arguments=<value optimized out>) at src/main.cpp:243
#7  0x000000000041ce0b in main (argc=<value optimized out>, argv=<value optimized out>) at src/main.cpp:153
===========================================================

The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5  0x000000000046448b in IslandAmplitude::AfterLastEntry (this=0x2661b00, gData=<value optimized out>, setup=<value optimized out>) at src/data_quality/IslandAmplitude.cpp:157
#6  0x000000000041c7b5 in Main_event_loop (dataTree=<value optimized out>, arguments=<value optimized out>) at src/main.cpp:243
#7  0x000000000041ce0b in main (argc=<value optimized out>, argv=<value optimized out>) at src/main.cpp:153

and the last few lines of stdout:

SiR1-3-F: using generator: CFTime with 1 option(s).
SiR1-3-F: using generator: SimpInt with 1 option(s).
ScGe: using generator: MaxBin with 1 option(s).
ScGe: using generator: CFTime with 1 option(s).
ScGe: using generator: SimpInt with 1 option(s).
SiR1-4-S: using generator: MaxBin with 1 option(s).
SiR1-4-S: using generator: CFTime with 1 option(s).
SiR1-4-S: using generator: SimpInt with 1 option(s).
SiR1-4-F: using generator: MaxBin with 1 option(s).
SiR1-4-F: using generator: CFTime with 1 option(s).
SiR1-4-F: using generator: SimpInt with 1 option(s).
-----IslandAmplitude::AfterLastEntry(): I'm debugging!

benkrikler commented 10 years ago

2 out of my 30 files bad_alloced. One made it to at least event 1221 but the other didn't get much further than event 131. I wonder what out memory limits are? Can we use ulimit -v to expand them perhaps?

litchfld commented 10 years ago

If we want to reset the production status of particular files should we dive into the DB and do it, leave it to John or is there a script? Runs affected are:

3288|F|litchfield_p|2014-07-11 15:11:48.404558|2014-07-11 15:15:02.260805
3289|F|litchfield_p|2014-07-11 15:11:48.466297|2014-07-11 15:14:54.326178
3290|F|litchfield_p|2014-07-11 15:11:48.512073|2014-07-11 15:14:56.360853
3291|F|litchfield_p|2014-07-11 15:11:48.553588|2014-07-11 15:15:09.431687
3292|F|litchfield_p|2014-07-11 15:11:48.599418|2014-07-11 15:15:02.582147
3293|F|litchfield_p|2014-07-11 15:11:48.646198|2014-07-11 15:15:02.668266
3294|F|litchfield_p|2014-07-11 15:11:48.683335|2014-07-11 15:14:56.510932
3295|F|litchfield_p|2014-07-11 15:11:48.720602|2014-07-11 15:13:47.235565
3322|F|litchfield_p|2014-07-11 15:11:48.757681|2014-07-11 15:14:53.417083
3323|F|litchfield_p|2014-07-11 15:11:48.924734|2014-07-11 15:15:02.117345
3324|F|litchfield_p|2014-07-11 15:13:54.709924|2014-07-11 15:14:25.453708
3326|F|litchfield_p|2014-07-11 15:14:32.505885|2014-07-11 15:16:42.501150
3327|F|litchfield_p|2014-07-11 15:15:01.927565|2014-07-11 15:17:10.692254

I discovered the git checkout rootana_v1.2 does not do what you expect if you don't start from the HEAD of master (see elog entry). I'm a bit suspicious about what the hell state the code was in when i started.

jrquirk commented 10 years ago

@benkrikler

On the grid you can request certain resources if you know you'll need them. At this point we don't know, and I think we should profile it on one of the bad runs. I have bad runs 2164 and 2166, so I'll try to those. I don't have enough experience to know what ulimit does, but it might be worth a try.

@litchfld I'll make a script that will mark unfound files as Not complete.

AndrewEdmonds11 commented 10 years ago

So, is it just coincidence that I didn't get any bad_allocs in my runs?

jrquirk commented 10 years ago

We don't know.

benkrikler commented 10 years ago

Ulimit handles memory and CPU timing parameters I think. I used it for a simulation which was throwing bad_alloc because we had enormous magnetic field maps being read in. I was able to get around it by increasing my virtual memory with ulimit -v. Unfortunately, both the batch system and our interactive sessions seem to have this set to maximum already. ulimit -a on the batch system gives:

core file size          (blocks, -c) unlimited
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 191992
max locked memory       (kbytes, -l) unlimited
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) 86400
max user processes              (-u) 1024
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Also, I have a number of core files in my home directory with time stamps that look like they came from crashes last night:

$ ls -l ~/core.*
-rw------- 1 krikler_b unx-alcap 3038412800 Jul 10 18:32 core.11416
-rw------- 1 krikler_b unx-alcap 3073249280 Jul 10 18:32 core.11417
-rw------- 1 krikler_b unx-alcap 1776484352 Jul 10 20:20 core.23121
-rw------- 1 krikler_b unx-alcap 1642180608 Jul 10 20:24 core.23165
-rw------- 1 krikler_b unx-alcap 1899106304 Jul 10 20:27 core.23320

$ file core.11416 
core.11416: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/gpfs/home/krikler_b/AlcapDAQ/analyzer/rootana/rootana -i /gpfs/home/edmonds_a/'

Anyone know what these could be?

My crashed runs were 2153 and 2155. Is there an easy way to just run with them?

litchfld commented 10 years ago

@jrquirk

@litchfld I'll make a script that will mark unfound files as Not complete.

I'm not sure what the relevance of this to anything I was talking about, are we talking at crossed purposes?

jrquirk commented 10 years ago

If we want to reset the production status of particular files should we dive into the DB and do it, leave it to John or is there a script?

litchfld commented 10 years ago

Ah I see. No, these files were found and ran fine.

But i'm suspicious about the state of my code base when I ran them. The "--display" flag of run_production.py was not available so the tag checkout can't have been 100% sucessful

litchfld commented 10 years ago

After making sure I really do have tag rootana_v1.2, I have a weird error when I restart processing. After running for a short time (i think until the first job completes) the script dies off with this trace:

Traceback (most recent call last):
  File "./run_production.py", line 198, in <module>
    runman.Finish(run)
  File "/gpfs/home/litchfield_p/alcap/AlcapDAQ/analyzer/batch/jobscripts/RunManager.py", line 198, in Finish
    os.rename(old_paths[ftype], new_paths[ftype])
OSError: [Errno 2] No such file or directory

Adding some debug statements I found this is when it is trying to move the file $data_dir/out/out03328.root => $data_dir/out/v1/out03328.root

There's no record of any problems: the err log file is empty, and the out log file is missing, possibly because the script copies in the order [err_log, out, out_log] The input tree file for 3328 exists (processed by nam) and looks normal. I shall investigate further, but if anyone has any ideas...

alcap-org / AlcapDAQ

Rootana production problems #148