GRIFFINCollaboration / GRSISort

A lean, mean, sorting machine.
MIT License
21 stars 54 forks source link

Subrun sorting problem #675

Closed MikeBowry closed 8 years ago

MikeBowry commented 8 years ago

Standby for a lot of screenshots...

Recently we were taking background spectra in GRIFFIN at low energies. As the data rate is low (we only use one channel), only one subrun was saved to disk as the subrun byte limit (2GB) was not exceeded. This run sorts successfully on my local machine (screen 1).

When reading through the entries in the fragment tree for this run I get several warnings (screen 2), although my script executes and outputs a spectrum (screen3). However, it looks like the warnings manifest as artefacts in the spectrum - for example the spike at 20 keV.

So I reduced the subrun byte limit to between 1-2MB, creating more subruns per run number. When I sort these new runs, two things happen: (1) GRSISort hangs up (screen4) or gives an error (screen5) and this always seems to occur for the first or last subrun, (2) the subruns that sort successfully give me nice spectra without any artefacts.

This behaviour seems kind of inconsistent and basically means that the first/last subruns have to be thrown away, unless it is possible to reject the bad entries in the tree. Searching the issues on GitHub I didn't see anything relating to this specific problem. Has anyone seen this behaviour before?

Thanks in advance, Mike

screen1: screen1 screen2: screen2 screen3: screen3 screen4: screen4 screen5: screen5

MikeBowry commented 8 years ago

I should mention that I updated my local GRSISort from the repository on Tuesday (29th). However, I have replaced several files in both /libraries and include/ with files from the grifuser workspace. For example, TDataParser (which is the most up-to-date on grifuser).

My local machine was upgraded to CentOS. We are using ROOT 6.04.12.

VinzenzBildstein commented 8 years ago

I do remember seeing similar issues a while ago (with the fNevBuf warnings), those went away with re-sorting the file, did you try that? It's hard to tell what the issues with the many (small) sub-runs are, the last screen shot is an error that comes from opening a root-file that was created by a program that crashed. It's also worth noting that in this case you have 0/0 fragments so something else went wrong during sorting. Could you try and re-sort this sub-run without suppressing the errors?

VinzenzBildstein commented 8 years ago

I see the same issues with run 6340 using the new_fragment branch. I will investigate this further once I have some time to do that (maybe next week).

r3dunlop commented 8 years ago

So which version of the data parser are you using?

On Mar 31, 2016, at 6:18 AM, Vinzenz Bildstein notifications@github.com wrote:

I see the same issues with run 6340 using the new_fragment branch. I will investigate this further once I have some time to do that (maybe next week).

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/675#issuecomment-203867127

VinzenzBildstein commented 8 years ago

Sorry, forgot to mention that. It's with the new_fragment branch.

r3dunlop commented 8 years ago

Yeah, I was just curious about Mike’s version too.

On Mar 31, 2016, at 9:44 AM, Vinzenz Bildstein notifications@github.com wrote:

Sorry, forgot to mention that. It's with the new_fragment branch.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/675#issuecomment-203945034

r3dunlop commented 8 years ago

Also, can we get the error output without errors suppressed? It might help identify the problem in the Data Parser (if that’s the problem).

On Mar 31, 2016, at 9:53 AM, Ryan Dunlop rdunlop@uoguelph.ca wrote:

Yeah, I was just curious about Mike’s version too.

On Mar 31, 2016, at 9:44 AM, Vinzenz Bildstein <notifications@github.com mailto:notifications@github.com> wrote:

Sorry, forgot to mention that. It's with the new_fragment branch.

— You are receiving this because you commented. Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/675#issuecomment-203945034

MikeBowry commented 8 years ago

@VinzenzBildstein Ok, thanks for checking. For the moment I can at least sort some of the data, so I can wait till next week.

That last subrun (6344_006) is weird. When I try to sort the .mid file, it gives the same output (i.e. "screen5") no matter if the errors are suppressed or not. Perhaps the .mid file itself is corrupted somehow.

When sorting run 6340 with the errors unsuppressed, I get the output shown in screens 6-8 (for example), attached below.

@r3dunlop The version of the data parser I am using is the one stored on grifuser@grsmid01. I believe it is the version described by the GRIFFIN Event Format document recently updated in the Technical Documentation section by Jenna, but not yet merged with the main repository.

screen 6: screen6 screen 7: screen7 screen8: screen8

r3dunlop commented 8 years ago

My fear is that with cfd changes etc, that the parser doesn't know what is coming to it. I don't know how much the newest parser has been tested.

On Mar 31, 2016, at 1:52 PM, Michael Bowry notifications@github.com wrote:

@VinzenzBildstein Ok, thanks for checking. For the moment I can at least sort some of the data, so I can wait till next week.

That last subrun (6344_006) is weird. When I try to sort the .mid file, it gives the same output (i.e. "screen5") no matter if the errors are suppressed or not. Perhaps the .mid file itself is corrupted somehow.

When sorting run 6340 with the errors unsuppressed, I get the output shown in screens 6-8 (for example), attached below.

@r3dunlop The version of the data parser I am using is the one stored on grifuser@grsmid01. I believe it is the version described by the GRIFFIN Event Format document recently updated in the Technical Documentation section by Jenna, but not yet merged with the main repository.

screen 6:

screen 7:

screen8:

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

MikeBowry commented 8 years ago

The latest parser has been partially tested, but there are some known issues (e.g. pileup identification) and some things that are still being developed (e.g. filters), so I guess this could be the problem.

Also, for the last set of runs I collected, around half failed to sort properly on the first try. However, subsequent re-sorts have recovered some of these files (as suggested). Right now, the chance of a successful/unsuccessful sort seems quite random.

r3dunlop commented 8 years ago

Sounds like a Q/thread problem.

On Mar 31, 2016, at 5:21 PM, Michael Bowry notifications@github.com wrote:

The latest parser has been partially tested, but there are some known issues (e.g. pileup identification) and some things that are still be developed (e.g. filters), so I guess this could be the problem.

Also, for the last set of runs I collected, around half failed to sort properly on the first try. However, subsequent re-sorts have recovered some of these files (as suggested). Right now, the chance of a successful/unsuccessful sort seems quite random.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/675#issuecomment-204134646

pcbend commented 8 years ago

A bit late to the party. Other discussion aside, there is something wrong with the method, your midas file size is way to short. In some cases, it only has room for the odb information. For example see the attached screenshot for 6344_006. All it contains is the begin of run information. I believer from the way the code is setup, if I find the begin of run info at the start of a file, I assume it has at least one event in it - this should be a good assumption.

screen5

pcbend commented 8 years ago

Also; what are the run numbers where there is a 50/50 sorting success ratio?

MikeBowry commented 8 years ago

There was a pattern (kind of) to the runs that sorted successfully/unsuccessfully on the first try:

bad

6,7,9,11,12,13,15,17,20,21,23,25,26,28,30,31

good

0-5,8,10,14,16,18,19,22,24,27,29,32,33

For a total of 34 subruns (0-33). So it seemed like, in places, every other run resulted in a failure.

Also; what are the run numbers where there is a 50/50 sorting success ratio?

pcbend commented 8 years ago

ok, so this is a bit tricky.

First I assumed you are talking about run 6346 as it has 33 subruns. These runs are rough 5MB, hence they have more than just a begin run. Next, these runs are not failing randomly, which was the impression I got above - the runs you listed are systematically failing at least when I run things on the aquila computer which has:

gcc -v == 4.8.5 root-config --version == 6.04/12

First to get anything to parse, I had to replace three files on the master: $GRSISYS/include/TDataParser.h $GRSISYS/libraries/TDataParser/TDataParser.cxx $GRSISYS/libraries/TGRSILoop/TGRSILoop.cxx The changes in these files from the master, at a glance, appear minor. I did not investigate this to deeply.

With these changes, I can see the subruns listed as bad above hanging the system. I can however get them to sort/parse with no issue making to edits to the code.

First; comment out the FragmentSelector bit in TGRSIint. It appears to be going into this section of code and not returning. On side from this, I have a slightly different solution to get live histograms out while sorting - if people are interested I can look at incorporating this into grsisort.

Second; comment out the TGRSIRunInfo::Streamer. My guess here is there is some bogus data set in the runinfo that the stream is unhappy about, and that streamer is headache to read. Since we have a parser which reads an info file from text to set the runinfo, why are we not using the streamer to simply store a TString of the info to the root file and read it back through the parser on Read as we do with TChannel? It keeps the streamer simplier and allows us to make inf updates to the TGRSIRunInfo with out ever needing to edit the streamer.

I have taken no action to fix either of these issues in the code. The issues may only be relevant to the gcc/root versions on aquila as I haven't encountered them any where else yet. If there is a desire for me to, I can work on it, just let me know.

VinzenzBildstein commented 8 years ago

One thing I noticed is that I only get those errors when I use GetEnergy() or GetCharge() on the fragment tree. If I just plot fPulseHeight, I don't get any errors and the spectrum looks normal (no spikes at powers of 2).

r3dunlop commented 8 years ago

Are the integration numbers ok?

On Apr 15, 2016, at 1:48 PM, Vinzenz Bildstein notifications@github.com wrote:

One thing I noticed is that I only get those errors when I use GetEnergy() or GetCharge() on the fragment tree. If I just plot fPulseHeight, I don't get any errors and the spectrum looks normal (no spikes at powers of 2).

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/GRIFFINCollaboration/GRSISort/issues/675#issuecomment-210562597

VinzenzBildstein commented 8 years ago

Yes, and plotting fPulseHeight/fIntLength gives a normal spectrum.

r3dunlop commented 8 years ago

Are you sure it's going through that if statement?

On Apr 15, 2016, at 5:28 PM, Vinzenz Bildstein notifications@github.com wrote:

Yes, and plotting fPulseHeight/fIntLength gives a normal spectrum.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

VinzenzBildstein commented 8 years ago

This whole things seems to stem from a broken root-file. I've written a short program that only processes fragments read from the tree where TTree::GetEntry() returns a positive number, and I don't get any spikes (even if I still get the warnings from above). So for some reason we're creating sometimes a broken root-file. I've re-sorted the same run (6340) three times, the first time there were 7106 broken entries, the second time none, and in the third try there were 47966 broken entries. The last sorting also gave errors during the sorting:

Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191856, badread=0, nerrors=1, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191857, badread=0, nerrors=2, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191858, badread=0, nerrors=3, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191859, badread=0, nerrors=4, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191860, badread=0, nerrors=5, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191861, badread=0, nerrors=6, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191862, badread=0, nerrors=7, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191863, badread=0, nerrors=8, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191864, badread=0, nerrors=9, basketnumber=12 file probably overwritten: stopping reporting error messages Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191865, badread=0, nerrors=10, basketnumber=12

r3dunlop commented 8 years ago

Do we have data race issues?

On Apr 18, 2016, at 12:00 PM, Vinzenz Bildstein notifications@github.com wrote:

This whole things seems to stem from a broken root-file. I've written a short program that only processes fragments read from the tree where TTree::GetEntry() returns a positive number, and I don't get any spikes (even if I still get the warnings from above). So for some reason we're creating sometimes a broken root-file. I've re-sorted the same run (6340) three times, the first time there were 7106 broken entries, the second time none, and in the third try there were 47966 broken entries. The last sorting also gave errors during the sorting:

Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191856, badread=0, nerrors=1, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191857, badread=0, nerrors=2, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191858, badread=0, nerrors=3, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191859, badread=0, nerrors=4, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191860, badread=0, nerrors=5, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191861, badread=0, nerrors=6, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191862, badread=0, nerrors=7, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191863, badread=0, nerrors=8, basketnumber=12 Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191864, badread=0, nerrors=9, basketnumber=12 file probably overwritten: stopping reporting error messages Error in TBranchElement::GetBasket: File: fragment06340_000.root at byte:0, branch:fMidasTimeStamp, entry:191865, badread=0, nerrors=10, basketnumber=12

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub

VinzenzBildstein commented 8 years ago

I've tried to use valgrind (using memcheck and helgrind) to determine the cause of this (memory leak or data race), but that was unsuccessful, because I never got any errors sorting this file with valgrind. This made me believe that our threading is the issue, so I tested turning off the thread to fill the scalers. This gave some weird numbers for the fragments that passed the sorting (more than 20000%), but I never got any errors from sorting the run this way (tried it 3 times). So I put the thread back in, and after the first sorting I got the errors again. So short term solution (until I figured out where exactly the problem lies) is to go into TGRSILoop.cxx and comment out lines 102-103 (the ones with fFillScalerThread and fFillScalerThreadRunning).

VinzenzBildstein commented 8 years ago

Of course that means not having any of the scaler data in your fragment file.

VinzenzBildstein commented 8 years ago

Okay, this should be fixed with pull request #685. What I basically did was add a mutex inside TGRSILoop that locks the access to TGRSIRootIO's FillFragmentTree and FillRateScalerTree so that one can not be accessed while the other is being used. I've sorted the run three times and never got any errors, so I hope this has fixed the issue.

We could also move this mutex into TGRSIRootIO itself, as long as no race can happened between the two TGRSIRootIO::Get() functions and the locking.

VinzenzBildstein commented 8 years ago

@MikeBowry, can you please confirm that this works for you?

VinzenzBildstein commented 8 years ago

Since there has been no response on this issue for almost a month, I'm going to assume it's resolved.