LDMX-Software / ldmx-sw

The Light Dark Matter eXperiment simulation and reconstruction framework.
https://ldmx-software.github.io
GNU General Public License v3.0
22 stars 21 forks source link

Break-Up Ecal Veto #1306

Open tomeichlersmith opened 3 years ago

tomeichlersmith commented 3 years ago

Edit: I've given this more thought as I chatted with @tvami on slack.

Currently, the EcalVetoProcessor is very hefty. Moreover, a lot of the variables calculated by the veto processor can be used by other analyses. With this in mind, my proposal is to break up the current veto processor into different processors that create different event bus objects.

  1. EcalHitStatistics - calculate various statistics of the hit collection that don't require other inputs. Things like energy-weighted average. These are relatively fast to calculate and don't require other inputs so could be just attached onto the end of the ECal reconstruction chain.
  2. SingleElectronShowerFeatures - the shower-shape features for the single-electron BDT that require additional inputs. Mainly things that depend on knowledge of the incident electron direction like containment. These are not slow themselves, but would require track reconstruction within the Recoil and thus should be separated so that we can pre-select the data on simpler statistics from (1) before taking the time to do track reconstruction.
  3. EcalMIPTracking - reconstruct straight and linear-regression tracks. These are helpful for other analyses (e.g. EaT), but could also be helpful for multi-electron anaylses. In addition, keeping it separate would allow us to avoid running it on already-veto'ed tracks (saving compute time) and enable the potential to develop an acts solution that can be swapped in-and-out as simple as a python config change.
  4. SingleElectronTargetBDT - BDT for separting target-based dark brem from ecal PN. This is to separate it from other ML that could be focused on ecal-based dark brem (EaT) or another ML method (DNN) or other anlaysis channels (multi-electron). Instead of taking a single, partially-formed class as input, it would take in several event objects (pretty much all of the objects produced by the above processors) and then create a new "result" object (or set of simpler objects) that only contain the decision and other metrics (like the inference value).

Hopefully, breaking up the ecal veto in this way will make it more maintain-able. This will also be cause for some additions to the event model, but I can hold off from removing the EcalVetoResult object until we are comfortable with breaking on-disk backwards compatibility.

bryngemark commented 1 year ago

+1 on this and in particular, MIP tracking should be pulled out and not run as part of every event (it's a relatively time consuming algorithm and is intended to be used on a subset consisting of tricky events). If it's a separate processor people are free to use it on any set of events they want.

danyi211 commented 6 months ago

@tomeichlersmith @bryngemark @vdutta The new version of BDT will include the number of straight tracks from MIP tracking in the feature list. It might be tricky to separate the MIP tracking out. The only way we could imagine is to train two versions of BDT, one with number of straight tracks and one without. But then the usual way of referring to the number of events after the BDT cut will be ambiguous. The time to evaluate two BDTs might be the same scale as running MIP tracking in the first place.

tomeichlersmith commented 6 months ago

I'm not saying folks who want to run the BDT should not be required to run MIP tracking, what I'm saying is there are a lot of folks who just use the shower features (like me for instance).

Perhaps the new BDT would require MIP tracking. In this way, the BDT processor would require the MIP Tracking processor to have been run before it. Perhaps another BDT does not require the MIP tracking and as such that other BDT processor would not require the MIP Tracking processor to run before it. Factorizing the code in this way allows for people to opt in for certain requirements if they want. Does that make sense?

bryngemark commented 6 months ago

as a side note, how is it that the BDT requires MIP tracking nowadays? when it was developed, MIP tracking rejected the last 10 events that the BDT didn't already reject. is it entirely unthinkable to have a separate MIP-tracking based veto step (perhaps in the context of a BDT if that's needed) that is only applied after a first BDT fails to reject an event?

tomeichlersmith commented 6 months ago

The number of straight and the number of linear-regression MIP tracks are included as feature inputs to the newer BDTs. I think I agree with you that a "fast" BDT would be nice to have since the MIP tracking is time consuming especially on events with a lot of hits that are going to be rejected anyways for other easier reasons.

Another reason (from my point of view) to factorize so that different BDTs and selections can be more explicity about their requirements.

danyi211 commented 6 months ago

The BDT will use the number of straight tracks from MIP tracking (not linear regression tracks).

We are considering breaking up EcalVeto into three processors:

I understand it could be helpful to have a simple BDT without MIP tracks. We can try to compare the performance with and without the straight tracks + additional MIP track selections.

tvami commented 6 months ago

how is it that the BDT requires MIP tracking nowadays?

that's the "mip" in "segmip" :P

I think our plan is to have:

tvami commented 6 months ago

(sorry Danyi, I think we typed at the same time! [but at least we are saying the same thing :D ])

bryngemark commented 6 months ago

thanks all for your patience, I think I was a little sloppy -- I understand and am aware that it is included, I was mostly wondering why it was considered necessary.

but anyways! it sounds like we have a good path ahead, I think the suggested split sounds really great and will cover the different needs really well.

tvami commented 1 month ago

I ran a version where MIP tracking is set to -1 for the BDT input in both signal and bkg: https://github.com/LDMX-Software/ldmx-sw/actions/runs/11320865028 The results are not terrible, but clearly I need a bigger stat study.

Signal: EcalVetoResults_EcalVetoResults_bdt_disc.pdf

Kaons: EcalVetoResults_EcalVetoResults_bdt_disc.pdf

EcalPN: EcalVetoResults_EcalVetoResults_bdt_disc.pdf