Meaning of signal MC - Githubissues

ktht commented 5 years ago

I stumbled upon this line while working inmaster102x branch: https://github.com/HEP-KBFI/hh-multilepton/blob/951ef905a34e5e4597b6a004f4b477dcc58af857/bin/analyze_hh_3l.cc#L254 There is no "signal" category in HH analysis; it's a relic from ttH analysis where we categorized ttH MC as "signal". In the context of this analysis, the HH samples have different category names but they all start with "signal_": https://github.com/HEP-KBFI/hh-multilepton/blob/951ef905a34e5e4597b6a004f4b477dcc58af857/python/samples/hhAnalyzeSamples_2017.py#L34 ttH MC is recategorized as "TTH": https://github.com/HEP-KBFI/hh-multilepton/blob/951ef905a34e5e4597b6a004f4b477dcc58af857/python/samples/hhAnalyzeSamples_2017.py#L47-L48

The thing is, we don't even have HH decay modes available in NanoAOD. Is there any interest to create histograms split by HH decay modes? I could implement something in the post-processing step. Otherwise, I'd vote for removing everything related to single Higgs decay modes.

Btw, this issue applies to bbww analysis as well, but I prefer to keep the discussion in one place.

P.S. Next time, just write:

bool isSignal = process_string == "signal";

acarvalh commented 5 years ago

The HH MC is separated by H decay modes for bbww and 4T, so in this sense, the separation for Higgs BR is naturally covered. For 4V and 2V2T we may need to separate WW/ZZ indeed (we can check if one dominates too much over another, I would leave this point open up to test that).

BTW, not even in ttH we should have a "signal", now that we have 3 signals. For "signal" as for ttH, this change should be coordinated with the config with the list of samples.

On that subject, note that the non-resonant HH should not be separated by "node", but taken as "signal_nonresonant_XXYY", where XXYY is the HH decay mode.

ktht commented 5 years ago

The HH MC is separated by H decay modes for bbww and 4T, so in this sense, the separation for Higgs BR is naturally covered. For 4V and 2V2T we may need to separate WW/ZZ indeed (we can check if one dominates too much over another, I would leave this point open up to test that).

Ah, yes, you're right. So we can remove isSignal variable and whatever depends on it.

edit: sorry, I read your older text. I think we can easily compute the decay modes here. And instead of checking the entire string process_string, we can check if it begins with signal.

BTW, not even in ttH we should have a "signal", now that we have 3 signals. For "signal" as for ttH, this change should be coordinated with the config with the list of samples.

That's not exactly true. All signal samples are categorized as "signal" in ttH, but different analysis modes (default, forBDTtraining and coupling_study) select different signal samples.

On that subject, note that the non-resonant HH should not be separated by "node", but taken as "signal_nonresonant_XXYY", where XXYY is the HH decay mode.

So, you're saying that e.g.

signal_ggf_nonresonant_node_sm_hh_tttt
signal_ggf_nonresonant_node_box_hh_tttt
signal_ggf_nonresonant_node_2_hh_tttt
...
signal_ggf_nonresonant_node_12_hh_tttt

should all be categorized as signal_ggf_nonresonant_hh_tttt?

acarvalh commented 5 years ago

The HH MC is separated by H decay modes for bbww and 4T, so in this sense, the separation for Higgs BR is naturally covered. For 4V and 2V2T we may need to separate WW/ZZ indeed (we can check if one dominates too much over another, I would leave this point open up to test that).

Ah, yes, you're right. So we can remove isSignal variable and whatever depends on it.

edit: sorry, I read your older text. I think we can easily compute the decay modes here. And instead of checking the entire string process_string, we can check if it begins with signal.

Maybe the tags business in HH are a bit more complicated:

We want to keep a tag for HH (that can be named as you wish) to separate:

gen_mhh is calculated only for HH MC (resonant and nonresonant)
reweighting/additional histograms for different weights are only calculated for HH nonresonant MC

And a tag for decay rates, that are only important on "multilepton" part. But, that still depends on the sample:

4V = 4W / 2W2Z / 4Z
2V2T = 2W2T / 2Z2T

BTW, not even in ttH we should have a "signal", now that we have 3 signals. For "signal" as for ttH, this change should be coordinated with the config with the list of samples.

That's not exactly true. All signal samples are categorized as "signal" in ttH, but different analysis modes (default, forBDTtraining and coupling_study) select different signal samples.

We select different samples, that can be named "signal" for tradition, or "ttH" for clarity.

For limits setting, practically, we use: ttH_hxx / tHq_hxx / tHW_hxx all set to signal (= negative in the .txt datacard), independent if a coupling study is being made or not. I am not even sure why we save a histogram with those processes without the decays up to prepareDatacards level, there should be a reason.

For forBDTtraining it really does not matter, as we just use the flat tree.

On that subject, note that the non-resonant HH should not be separated by "node", but taken as "signal_nonresonant_XXYY", where XXYY is the HH decay mode.

So, you're saying that e.g.
signal_ggf_nonresonant_node_sm_hh_tttt
signal_ggf_nonresonant_node_box_hh_tttt
signal_ggf_nonresonant_node_2_hh_tttt
...
signal_ggf_nonresonant_node_12_hh_tttt
should all be categorized as signal_ggf_nonresonant_hh_tttt?

Yes, you will understand that better when to review the reweighting implementation I sent the description by mail. If we need to continue this part of the conversation a follow up of that thread or Skype seems better channels.

ktht commented 5 years ago

For limits setting, practically, we use: ttH_hxx / tHq_hxx / tHW_hxx all set to signal (= negative in the .txt datacard), independent if a coupling study is being made or not.

I was talking about how samples are categorized in sample dictionaries, not how they're split by decay modes at the analysis level.

I am not even sure why we save a histogram with those processes without the decays up to prepareDatacards level, there should be a reason.

Probably for histogrical reasons. We could remove these histograms if it would speed up the workflow.

Anyways, I defined the following decay modes to handle the ambiguities in HH multilepton decay modes:

  { "tttt",       15 }, // H -> 4tau
  { "zzzz",       23 }, // H -> 4Z
  { "wwww",       24 }, // H -> 4W
  { "ttzz", 15000023 }, // H -> 2tau 2Z
  { "ttww", 15000024 }, // H -> 2tau 2W
  { "zzww", 23000024 }, // H -> 2Z 2W

Also, can we merge non-resonant VBF samples, too?

acarvalh commented 5 years ago

Karl Ehatäht notifications@github.com escreveu em sáb, 8/06/2019 às 22:23 :

For limits setting, practically, we use: ttH_hxx / tHq_hxx / tHW_hxx all set to signal (= negative in the .txt datacard), independent if a coupling study is being made or not.

I was talking about how samples are categorized in sample dictionaries, not how they're split by decay modes at the analysis level.

i know, this is just to exemplify that at analysis level it is just a tag, and that there is no much deep meaning in between the choice of the tag be “signal” or “ttH. Not important anyways.

I am not even sure why we save a histogram with those processes without the decays up to prepareDatacards level, there should be a reason.

Probably for histogrical reasons. We could remove these histograms if it would speed up the workflow.

i vote for that, I can remove next time I run ttH analysis channels (Monday). The same logic would stand to hh (for all the relevant samples). We can ask to whoever is running multilepton and bow channel next time to follow the example.

Anyways, I defined the following decay modes to handle the ambiguities in HH multilepton decay modes:

{ "tttt", 15 }, // H -> 4tau { "zzzz", 23 }, // H -> 4Z { "wwww", 24 }, // H -> 4W { "ttzz", 15000023 }, // H -> 2tau 2Z { "ttww", 15000024 }, // H -> 2tau 2W { "zzww", 23000024 }, // H -> 2Z 2W

Also, can we merge non-resonant VBF samples, too?

perfect, yes. I am not sure we need to book all five types to all samples. We may still want to use tags to book only the necessary for each sample.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/HEP-KBFI/hh-multilepton/issues/1?email_source=notifications&email_token=ABMLS3UDQBOWX7MBNERIGMLPZQILTA5CNFSM4HVZJL6KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXH5JXI#issuecomment-500159709, or mute the thread https://github.com/notifications/unsubscribe-auth/ABMLS3SXMB32TMMORX3PO2DPZQILTANCNFSM4HVZJL6A .

veelken commented 5 years ago

Hi,

the histograms with the name "signal" were used in the 2016 version of the ttH analysis. I confirm that we don't use it anymore, but I don't see that it would cost us much to keep these histograms.

For the HH analysis, what we need is:

histograms for different HH decay modes need different names. It is fine if we use different "sample_category" values in the hhAnalyzeSample.py file for that purpose, since, as Xandra mentioned, each HH decay mode is contained in a separate MC sample
I believe that histograms for different non-resonant "nodes" (benchmark scenarios for couplings) need to have different names, so that they can still be distinguished in the hadd_stage2.root file
I believe that we need to keep the bool isSignal flag, because gen_mHH is taken from MC truth for signal MC samples, while for background MC samples, the BDT output needs to be computed for all gen_mHH values for which we have signal MC samples

Cheers,

Christian

HEP-KBFI / hh-multilepton

Meaning of signal MC #1