Closed jrquirk closed 10 years ago
Has production started then or are you just testing? I haven't had a chance to sort out the plotting modules. I was about to do it now and hopefully it won't take me long so if we can hold off actually starting for a few hours that would be good.
I had actually started. But we can start again whenever. It's not a big time commitment to start it. All of a minute.
Ok. I'll be as quick as I can.
Here's a problem for us:
gAnalysedPulseMap
which we've been using to store the TAPs is still StringAnalPulseMap
. Somehow it was never changed to be a SourceAnalPulseMap
, meaning that the keys are just strings. Moreover, those strings are currently being filled with just the detector name so when we run with multiple generators we're overwriting the TAPs of the previous generator to look at a channel. That was the whole point of the IDs::source
class, yet we never actually switched to using them.
I'm going to change this over, but I'll have to go through all code that accesses gAnalysedPulseMap
and reflect this change. To be continued no doubt....
Facepalm moment coming:
We have no mechanism to write the TDetectorPulses or TMuonEvents to file right now, so these won't appear in the output files. I can probably copy over the TAnalysedPulseMapWrapper code quite quickly if we want. It's lightweight stuff so it shouldn't be too difficult.
I've gone through all code and swapped gAnalysedPulseMap
to be of SourceAnalPulseMap
type. Everything is compiling ok and seems to run fine, though it's hard to do a thorough check. I guess that's the point of the run.
The PulseViewer however is now having troubles because we need to specify a source string, but since we're not using this module for the production I'll leave it for now. Shouldn't be too hard to fix anyway.
I'll finish off the PlotMuScTDiff module, allow MakeAnalysedPulses to run with multiple default generators, quickly test it and then push everything.
Edit: I think an error like this when a file doesn't exist is okay, because it should've existed. Runs 3450-3454 are now marked as 'E'.
We do have trees 3450 - 3454...
Edit: Runs 3771 and 3772 are missing as well.
...and 3771 (not 3772)...
Edit: And 3770
...and 3770
What's happening here?
I've done the best I can, but there's been a series of bugs I've found with each one leading to another.
I've published the feature, but not yet merged it. Here's a summary of the changes I've had to make:
SourceAnalPulseMap
so we can really have multiple generators per channel and propagated this change through all code.MakeDetectorPulses
handles TAPs for multiple sources per channel properly and uses ID classes insteadSetupNavigator
can return the bank name for a given Channel IDIDs::channel
has a method to obtain the ID of the corresponding fast / slow channelI think this is now in a state to run, but unfortunately I've not been able to fix an inifite loop problem in the MaxTimeDiffDPgenerator
algorithm. That means we can't run with MakeDetectorPulses
and therefore not MakeMuonEvents
this time. I'll open another ticket for this problem to be solved in the future. It was previously not detected because it seems MakeDetectorPulses was not obtaining the pulse lists properly. EDIT: And as I said above there's no way of writing TDPs or TMEs to file yet so there's not much point in running with them unless it's to test timing / performance.
All of these problems would be easier if we had a functioning event navigator, hint Phill hint.
I'm going to merge all this on to develop if no one objects. Then we can resume production?
Well, then we'll have to merge it onto master and tag it (v1.1?). Then, looking at John's instructions, it's just a case of changing this:
./run_production.py --production=rootana --version=1
to this:
./run_production.py --production=rootana --version=1.1
And also working out what's wrong with the missing files...
I'll merge this into develop then but I should hold from tagging another production version while we sort the missing files?
I think so. I don't entirely understand what the problem is, so I think we should wait for John
A note on how the production looks for the files. The tree locations are saved in the production database (/gpfs/home/$USER/data/tree/$VERSION/$FILE) except this time where the VERSION just isn't a thing.
We do have trees 3450 - 3454...
I thought they'd be in Joe's directory and didn't bother checking any others. I found them now.
Well, then we'll have to merge it onto master and tag it (v1.1?)
The production scripts only allow integer version numbers, so this would be two in the production database. If a new version is started it will be version 2 (it's automatically incremented). We could change it, but I don't think it matters. To start a new production a single person should type
./run_production.py --production=rootana --new=git_software_tag --version=3
This would say start a new version of rootana production, making note of what software tag it uses, and use alcapana version 3 files. It will print out something like "Starting new version ##". Then, if ## is 2 (as it should be) people can run
./run_production.py --production=rootana --version=2
And this says continue with rootana production version 2. Here it's clear the version argument has a different meaning if the --new
argument is passed (which can also be changed if people want). Anyway, people can also just run
./run_production.py --production=rootana
and it will just look for the most recent version and run that. I didn't mention that before because I kind of regret putting that functionality in. The more problems we run into the more I realize @litchfld 's opinion of failing loudly is almost always preferred, and this means maybe not having ambiguous defaults.
Thanks, John. Let us know when we can restart production.
I've just merged my feature branch in and quickly tested it. It looks reasonable to me.
I also updated the production.cfg file to remove the MakeDetectorPulses and MakeMuonEvents modules, add in the plot modules and use only one MakeAnalysedPulses module. @jrquirk maybe check that it looks ok to you first?
I'll give the file a quick eyeball check. Stay tuned.
Also, of the files I thought were missing, Andy pointed out one actually is. I'm marking it as 'E' and we'll deal with it.
I'm also getting rid of the last production data table. So this next version once started will be version 1 in the production table, but will be tagged with v1.1 software version.
Could we add an integer sub-version column to the table? We had a couple of false starts with alcapana as well, so it might make life easier.
Yes, but I'm going to wait on that. I know I don't want version to be a float, but dot-separated ints like that are okay. This will take a few lines of code and a few tests.
Okay, it's ready. Run
$ git fetch; git checkout rootana_v1.1
$ ./run_production --production=rootana --version=1 --display
Display adds a curses display. You can also not use it by omitting the flag.
There will be errors, as always, so let's keep this open.
Don't forget to recompile rootana!
Which version of alcapana do we need to compile? The one in the data_quality directory? Or the $USER directory?
As long as they all provide the same version of TSetupData, TGlobalData and TPulseIsland I don't think it matters. EDIT What John says sounds safer.
Ok, I've managed to get it working. Will let you know if it crashes
When I try to run the display under Jim's account, I get some out-of-bounds exception. But it still works without the display flag.
I think I'm seeing the same error John described for Jim's account:
$ ./run_production.py --production=rootana --version=1 --display
File "./run_production.py", line 146, in <module>
screenman.SetProgram(production)
File "/gpfs/home/krikler_b/AlcapDAQ/analyzer/batch/jobscripts/ScreenManager.py", line 77, in SetProgram
self.Draw()
File "/gpfs/home/krikler_b/AlcapDAQ/analyzer/batch/jobscripts/ScreenManager.py", line 188, in Draw
self.screen.addstr("%-80s" % self.messages[self.print_msg])
IndexError: list index out of range
The spacing is as it appeared for me, so the line wrapping being funny suggests it's upsetting the columns. It also breaks readline so I need to do reset
to get control back of the terminal.
When you use ncurses, it's common to turn off "echo" in the shell. Upon successful exiting echoing is reenabled. Upon failed exiting echo is usually not. Hence why we always need to type "reset".
When you use ncurses, it's common to turn off "echo" in the shell. Upon successful exiting echoing is reenabled. Upon failed exiting echo is usually not. Hence why we always need to type "reset".
I learn something new every day on Alcap....
We're getting std::bad_allocs from we think the AmpvTDiff histos, so the offending module has been removed from production.cfg and version 1.2 of rootana has been tagged. I need to reset the production database now though. And then we can restart. The display problem is also fixed.
Do we need to do any particular cleanup? Delete the old output files etc?
Oh yeah. Remove the logs and output files in log/ and out/.
I'm still getting a couple bad_allocs. And jobs are sitting in the queue for over an hour and a half. I'm stopping for now. I think we should profile this some.
Ok. I've restarted.
Nice display BTW
I've now run out of space but I did manage to do 230 runs.
Which version of alcapana do we need to compile? The one in the data_quality directory? Or the $USER directory?
The one in your user directory.
As long as they all provide the same version of TSetupData, TGlobalData and TPulseIsland I don't think it matters.EDIT What John says sounds safer.
Now that we've stopped doing active development of alcapana perhaps it would make sense to put this stuff in a standard location rather than .../$USER/...
@AndrewEdmonds11 how many of your runs failed due to bad_alloc?
The only run with an error log with something in it is run 3257. Wasn't the bad_alloc error from v1.1?
Yes, but even so I still had 2 of the six or eight runs that actually made it through to the queue fail.
Ok. Here's what I had in the error log:
*** Break *** segmentation violation
===========================================================
There was a crash.
This is the entire stack trace of all threads:
===========================================================
#0 0x00000037fe2ac8be in waitpid () from /lib64/libc.so.6
#1 0x00000037fe23e909 in do_system () from /lib64/libc.so.6
#2 0x00002b3472fab998 in TUnixSystem::StackTrace (this=0x1e4ba50) at /gpfs/home/edmonds_a/AlcapDAQ/root/core/unix/src/TUnixSystem.cxx:2403
#3 0x00002b3472faa813 in TUnixSystem::DispatchSignals (this=0x1e4ba50, sig=kSigSegmentationViolation) at /gpfs/home/edmonds_a/AlcapDAQ/root/core/unix/src/TUnixSystem.cxx:1279
#4 <signal handler called>
#5 0x000000000046448b in IslandAmplitude::AfterLastEntry (this=0x2661b00, gData=<value optimized out>, setup=<value optimized out>) at src/data_quality/IslandAmplitude.cpp:157
#6 0x000000000041c7b5 in Main_event_loop (dataTree=<value optimized out>, arguments=<value optimized out>) at src/main.cpp:243
#7 0x000000000041ce0b in main (argc=<value optimized out>, argv=<value optimized out>) at src/main.cpp:153
===========================================================
The lines below might hint at the cause of the crash.
If they do not help you then please submit a bug report at
http://root.cern.ch/bugs. Please post the ENTIRE stack trace
from above as an attachment in addition to anything else
that might help us fixing this issue.
===========================================================
#5 0x000000000046448b in IslandAmplitude::AfterLastEntry (this=0x2661b00, gData=<value optimized out>, setup=<value optimized out>) at src/data_quality/IslandAmplitude.cpp:157
#6 0x000000000041c7b5 in Main_event_loop (dataTree=<value optimized out>, arguments=<value optimized out>) at src/main.cpp:243
#7 0x000000000041ce0b in main (argc=<value optimized out>, argv=<value optimized out>) at src/main.cpp:153
and the last few lines of stdout:
SiR1-3-F: using generator: CFTime with 1 option(s).
SiR1-3-F: using generator: SimpInt with 1 option(s).
ScGe: using generator: MaxBin with 1 option(s).
ScGe: using generator: CFTime with 1 option(s).
ScGe: using generator: SimpInt with 1 option(s).
SiR1-4-S: using generator: MaxBin with 1 option(s).
SiR1-4-S: using generator: CFTime with 1 option(s).
SiR1-4-S: using generator: SimpInt with 1 option(s).
SiR1-4-F: using generator: MaxBin with 1 option(s).
SiR1-4-F: using generator: CFTime with 1 option(s).
SiR1-4-F: using generator: SimpInt with 1 option(s).
-----IslandAmplitude::AfterLastEntry(): I'm debugging!
2 out of my 30 files bad_alloc
ed. One made it to at least event 1221 but the other didn't get much further than event 131. I wonder what out memory limits are? Can we use ulimit -v
to expand them perhaps?
If we want to reset the production status of particular files should we dive into the DB and do it, leave it to John or is there a script? Runs affected are:
3288|F|litchfield_p|2014-07-11 15:11:48.404558|2014-07-11 15:15:02.260805
3289|F|litchfield_p|2014-07-11 15:11:48.466297|2014-07-11 15:14:54.326178
3290|F|litchfield_p|2014-07-11 15:11:48.512073|2014-07-11 15:14:56.360853
3291|F|litchfield_p|2014-07-11 15:11:48.553588|2014-07-11 15:15:09.431687
3292|F|litchfield_p|2014-07-11 15:11:48.599418|2014-07-11 15:15:02.582147
3293|F|litchfield_p|2014-07-11 15:11:48.646198|2014-07-11 15:15:02.668266
3294|F|litchfield_p|2014-07-11 15:11:48.683335|2014-07-11 15:14:56.510932
3295|F|litchfield_p|2014-07-11 15:11:48.720602|2014-07-11 15:13:47.235565
3322|F|litchfield_p|2014-07-11 15:11:48.757681|2014-07-11 15:14:53.417083
3323|F|litchfield_p|2014-07-11 15:11:48.924734|2014-07-11 15:15:02.117345
3324|F|litchfield_p|2014-07-11 15:13:54.709924|2014-07-11 15:14:25.453708
3326|F|litchfield_p|2014-07-11 15:14:32.505885|2014-07-11 15:16:42.501150
3327|F|litchfield_p|2014-07-11 15:15:01.927565|2014-07-11 15:17:10.692254
I discovered the git checkout rootana_v1.2
does not do what you expect if you don't start from the HEAD of master (see elog entry). I'm a bit suspicious about what the hell state the code was in when i started.
@benkrikler
On the grid you can request certain resources if you know you'll need them. At this point we don't know, and I think we should profile it on one of the bad runs. I have bad runs 2164 and 2166, so I'll try to those. I don't have enough experience to know what ulimit
does, but it might be worth a try.
@litchfld I'll make a script that will mark unfound files as Not complete.
So, is it just coincidence that I didn't get any bad_allocs in my runs?
We don't know.
Ulimit handles memory and CPU timing parameters I think. I used it for a simulation which was throwing bad_alloc
because we had enormous magnetic field maps being read in. I was able to get around it by increasing my virtual memory with ulimit -v
. Unfortunately, both the batch system and our interactive sessions seem to have this set to maximum already. ulimit -a
on the batch system gives:
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 191992
max locked memory (kbytes, -l) unlimited
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) unlimited
cpu time (seconds, -t) 86400
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
Also, I have a number of core
files in my home directory with time stamps that look like they came from crashes last night:
$ ls -l ~/core.*
-rw------- 1 krikler_b unx-alcap 3038412800 Jul 10 18:32 core.11416
-rw------- 1 krikler_b unx-alcap 3073249280 Jul 10 18:32 core.11417
-rw------- 1 krikler_b unx-alcap 1776484352 Jul 10 20:20 core.23121
-rw------- 1 krikler_b unx-alcap 1642180608 Jul 10 20:24 core.23165
-rw------- 1 krikler_b unx-alcap 1899106304 Jul 10 20:27 core.23320
$ file core.11416
core.11416: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/gpfs/home/krikler_b/AlcapDAQ/analyzer/rootana/rootana -i /gpfs/home/edmonds_a/'
Anyone know what these could be?
My crashed runs were 2153 and 2155. Is there an easy way to just run with them?
@jrquirk
@litchfld I'll make a script that will mark unfound files as Not complete.
I'm not sure what the relevance of this to anything I was talking about, are we talking at crossed purposes?
If we want to reset the production status of particular files should we dive into the DB and do it, leave it to John or is there a script?
Ah I see. No, these files were found and ran fine.
But i'm suspicious about the state of my code base when I ran them. The "--display" flag of run_production.py was not available so the tag checkout can't have been 100% sucessful
After making sure I really do have tag rootana_v1.2, I have a weird error when I restart processing. After running for a short time (i think until the first job completes) the script dies off with this trace:
Traceback (most recent call last):
File "./run_production.py", line 198, in <module>
runman.Finish(run)
File "/gpfs/home/litchfield_p/alcap/AlcapDAQ/analyzer/batch/jobscripts/RunManager.py", line 198, in Finish
os.rename(old_paths[ftype], new_paths[ftype])
OSError: [Errno 2] No such file or directory
Adding some debug statements I found this is when it is trying to move the file
$data_dir/out/out03328.root => $data_dir/out/v1/out03328.root
There's no record of any problems: the err log file is empty, and the out log file is missing, possibly because the script copies in the order [err_log, out, out_log] The input tree file for 3328 exists (processed by nam) and looks normal. I shall investigate further, but if anyone has any ideas...
Here I think we can walk through this first production together. The problem I'm fixing now:
It seems this tree file doesn't exist. So I'm marking them as having an error status (E) in the database.
Edit: I think an error like this when a file doesn't exist is okay, because it should've existed. Runs 3450-3454 are now marked as 'E'.
Edit: Runs 3771 and 3772 are missing as well.
Edit: And 3770