Output Format - Githubissues

bkatiemills commented 10 years ago

So - the core topic from today's meeting centered on what data format(s) our simulations should be outputting. Key points:

Things We Agreed On

Some points were (more or less) unanimous, so we'll take them as final decisions to start working from:

There must be a completely lossless output format option that preserves Geant4 step-level information, especially for the sake of detector development.
There must be an output format that is either identical to GRSISpoon's output, or a superset thereof (ie TigFragments + Truth), especially targeted at preparing physics analyses.
Calibration and triggers should be downstream of the simulation package.
Things We Argued About

There is still some debate on what information we can and/or should put where:

How best to store step-level information?
If / how to modify TigFragment to accommodate hit level information.
How will we store truth information that is most useful for students & experimenters who only want physics information? Leading suggestion == truth branch in the output tree.
What needs to be in the truth data?
A First-Order Solution

Debate is still completely open on this topic, but here is one possible scheme we can start perturbing from until we get an optimal solution:

Simulations can output either or both of two filetypes:
- Step nTuple: a flat Geant4 ntuple containing absolutely all step-level information generated in the simulation.
- TigFragment Tree + Truth: the exact tree that would be output by GRSISpoon if this had been real data, plus a truth branch which contains the hit (but not step) level information.

I believe this scheme covers all users without carrying around extra dead weight; @carlu, the step ntuple contains all the information you need for detector development in a format that should be familiar; @evan012345, I believe that the TigFragment tree will ultimately be the most usable for students (and everyone else), since users will have to be able to write analyses that can operate on this structure if they ever want to analyze real data, since this is what comes out of our sort code. Also keep in mind that a standard postprocessor that chews up TigFragments and spits out a simpler flat ntuple of meaningful physics parameters is a very viable product that we can produce to help users interpret the output of both GRSISpoon and detectorSimulations.

Other auxiliary points that we discussed to include in this first-order approximation are:

Truth branch in the tree should be an empty vector or null pointer when unused, and not just a bunch of zeroes in order to mitigate bloat on disk.
Truth branch should include at a minimum:
- Hit position
- PID
- Parent process
- Primary event
The variables included in the truth branch should be toggle-able from the .mac; for the time being we will stick to a small set of truth variables to produce a minimum viable product.
Forward

This is a high priority issue that needs to be resolved before we can move too far further. I expect the collaboration to be able to reach consensus by the end of April at the very latest, whereupon we will accept the best plan and move forward with it. Please propose and debate all changes in the comments, so we can keep track of everyone's input.

cc @AdamGarnsworthy @pcbend @damiller @christinaburbadge @moukaddam @evitts @r3dunlop

carlu commented 10 years ago

Hi everyone.

First off, I'm really sorry I missed the meeting. I was gain-matching BGOs and lost track of time. I don't enjoy gain-matching nearly as much as that suggests. Sorry to come in late with this, but I have an issue with one of your agreed points: "triggers should be downstream of the simulation package".

Consider the stable-beam TIP experiment S1232, which is about to begin. We have tons of beam and will be applying a really selective trigger to get the rate down to something the DAQ can handle. Probably 2 CsI and 1 HPGe in coincidence will be required to trigger data readout. If we decided to simulate this experiment without modelling the trigger the output file will be 90+% full of events which we would never see in the experiment. This would fill disk space and ultimately waste CPU and disk access time while we parse the file and strip out all of these events to reveal the simulated events we're interested in. I think the generation of coincidences before the output file is written could be a valuable tool.

I'm less concerned about adding energy resolution and I agree that this could be added later without any hit on performance. One point though, if we are going to want to look at Doppler broadening of gamma lines in TIGRESS runs we will have a natural width imposed to the peaks there anyway. So even without random broadening there will be some width to the peaks, why not make it the correct width?

carlu commented 10 years ago

I agree that the ntuple out described by @BillMills will satisfy application I can think of. The particular tasks I have in mind for the "true" G4 data are:

Making judgements as the the best add-back and suppression schemes for a given experimental configuration.
Combining with an electric field simulation to produce a simulated detector response for a differential crosstalk correction.
Testing the results of any work on PSA algorithms for improving position resolution.

bkatiemills commented 10 years ago

@carlu the problem with applying cuts in simulation is that you create systematics that you are by definition blind to; you said the trigger for S1232 was 'probably' going to be xyz, but if those cuts are applied in Geant4 we will never know if xyz was a good choice of trigger or not. Trigger studies such as this are (should be!) one of the main points of doing simulations for every experiment; other analysis goals can be carried out on reduced data after the trigger is concluded, but not examining this major systematic for every experiment seems like a big mistake to me. Also, while in principle users can just 'be careful' to not throw away data they might want to look at, the reality in practice is that this just creates a loaded gun that people are going to shoot themselves with over and over again. Filtering data by a per-event trigger is by definition an O(N) process that ROOT is very good at and is trivial to parallelize; high-risk of needing to rerun simulations seem like too high a price to pay to avoid this.

GRIFFINCollaboration / detectorSimulations

Output Format #90

Things We Agreed On

Things We Argued About

A First-Order Solution

Forward