cms-analysis / flashgg

20 stars 158 forks source link

L2 Mass regression integration #1217

Closed panwarlsweet closed 3 years ago

panwarlsweet commented 4 years ago
simonepigazzini commented 4 years ago

@panwarlsweet you should first merge from the dev_legacy_runII branch, fix conflict and then push a PR

simonepigazzini commented 4 years ago

Please use an editor that preserve the indentation (we have automatic indentation setup for both emacs and vim).

The year customization logic is wrong, it should be done through MetaCondition not through categorical variables

panwarlsweet commented 4 years ago

Hello @simonepigazzini

I am working on your feedback. Since I have resolved all conflicts and pulled all changes from central flashgg and tied to test the setup. But it gives some error I used command: python flashgg/Systematics/test/workspaceStd.py metaConditions=/afs/cern.ch/work/l/lata/HHbbgg_analysis/testing_devlegacy/CMSSW_10_6_8/src/flashgg/MetaData/data/MetaConditions/Era2016_RR-17Jul2018_v1.json

and error it gives: Here we account for L1 pre-firing. We will only change the central diphoton weight if it is an appropriate year (2016 or 2017), an appropriate sample (only MC, not data), and the applyToCentral flag is set to true Background MC, so store mgg and central only --- Systematics with independent collections --- ['']

--- Variables to be dumped, including systematic weights --- ['CMS_hgg_mass[160,100,180]:=diPhoton().mass', 'sigmaRV:=diPhotonMVA().sigmarv', 'sigmaMoM_decorr:=diPhotonMVA().decorrSigmarv']

Traceback (most recent call last): File "flashgg/Systematics/test/workspaceStd.py", line 544, in filterHLTrigger(process, customize) File "/afs/cern.ch/work/l/lata/HH_bbyy/CMSSW_10_6_8/python/flashgg/Systematics/SystematicsCustomize.py", line 387, in filterHLTrigger if re.match(regDset, options.datasetName()): File "/cvmfs/cms.cern.ch/slc7_amd64_gcc700/external/python/2.7.14-pafccj/lib/python2.7/re.py", line 141, in match return _compile(pattern, flags).match(string) TypeError: expected string or buffer

I also tried the same with setup from central flashgg and got same error. Could you please help how to fix this?

simonepigazzini commented 4 years ago

@panwarlsweet I think you should at least specify the dataset you want to run on in order for workspaceStd to work properly.

simonepigazzini commented 4 years ago

python Systematics/test/workspaceStd.py metaConditions=MetaData/data/MetaConditions/Era2016_RR-17Jul2018_v1.json dataset=/DoubleEG/spigazzi-Era2016_RR-17Jul2018_v2-legacyRun2FullV1-v0-Run2016B-17Jul2018_ver2-v1-86023db6be00ee64cd62a3172358fb9f/USER campaign=Era2016_RR-17Jul2018_v2

this is the minimal you can run with. Still it is much better to run through fggRunJobs, if you just want to run a local job with python instead of cmsRun you can specify -x python -n 1 and no queue.

panwarlsweet commented 4 years ago

python Systematics/test/workspaceStd.py metaConditions=MetaData/data/MetaConditions/Era2016_RR-17Jul2018_v1.json dataset=/DoubleEG/spigazzi-Era2016_RR-17Jul2018_v2-legacyRun2FullV1-v0-Run2016B-17Jul2018_ver2-v1-86023db6be00ee64cd62a3172358fb9f/USER campaign=Era2016_RR-17Jul2018_v2

this is the minimal you can run with. Still it is much better to run through fggRunJobs, if you just want to run a local job with python instead of cmsRun you can specify -x python -n 1 and no queue.

Thanks for this suggestion @simonepigazzini . Actully first I tried with fggRunjobs but it didn't work because of x-sec json file which you have fixed two days ago. Then for a simple test I used python command.

After pulling the latest changes from flashgg (which fixed that json file) my fggrunjobs.py works fine.

panwarlsweet commented 4 years ago

@simonepigazzini

as suggested during Today's meeting, I have impelented everything. Kindly have a look at the PR and and if you find everything fine then please merge it.

I have already tested code for MC and Data for trees and workspaces both. It works fine. I also showed a validation plot in today'e meeting

https://indico.cern.ch/event/922777/contributions/3920714/attachments/2063685/3462709/L2_Regression_in_flashgg.pdf

simonepigazzini commented 4 years ago

I am away at the moment, I will look into this when I get back next week.

simone

panwarlsweet commented 4 years ago

@simonepigazzini when you review the PR again, could you please have a look at the last two commits (c559f69 and 930fdd1) and varify that removing these additional Mjj cuts is good since I don't find it feasible to apply Mjj cuts 3 times within same event. There is mjj cut we already apply on mjj before selecting btagged pair and that is good enough.

nadya-chernyavskaya commented 4 years ago

@simonepigazzini when you review the PR again, could you please have a look at the last two commits (c559f69 and 930fdd1) and varify that removing these additional Mjj cuts is good since I don't find it feasible to apply Mjj cuts 3 times within same event. There is mjj cut we already apply on mjj before selecting btagged pair and that is good enough.

@panwarlsweet Lata, just please make sure it is done in exactly the same way for VBF DoubleHTag producer and DoubleHTag producer.

panwarlsweet commented 4 years ago

@chernyavskaya yes I have checked properly and tested setup. It is same for both VBFHH and HH Tagger producer

nadya-chernyavskaya commented 4 years ago

@panwarlsweet

Please make sure that in the input variables for the training (both ggHH and VBFHH) are used consistently : e.g. MX, sigmaMjj/Mjj should use corrected Mjj values. In addition, please check that categorization (again for both ggHH and VBFHH) is done on corrected MX. Thanks.

panwarlsweet commented 4 years ago

@chernyavskaya

The main impact of the regression is to improve Mjj resolution in Likelihood fitting. so, it would have no visible effect anywhere else. In that sense this PR is correct and self consistent.

The changes you propose would have a cosmetic effect and require retraining of the MVA's. Please let me know when you perform this retraining and I would be happy to create an other PR for the changes you have mentioned.

nadya-chernyavskaya commented 4 years ago

@chernyavskaya

The main impact of the regression is to improve Mjj resolution in Likelihood fitting. so, it would have no visible effect anywhere else. In that sense this PR is correct and self consistent.

The changes you propose would have a cosmetic effect and require retraining of the MVA's. Please let me know when you perform this retraining and I would be happy to create an other PR for the changes you have mentioned.

@panwarlsweet As HIG conveners said, you should study the effect on the retraining. However, concerning the code, the retraining part is irrelevant. Here I gave the comments that are imperative to have the code self-consistent. And it is needless to say that the code must be self-consistent to be merged in the master branch. The code will not be merged unless the variables are used in a consistent way, e.g as it is done with DNN-regression, and everything else that was developed for flashgg.

The small improvement of Mjj resolution might not have an effect on anything else, but if this will eventually be included in the publication, the regressed Mjj must be used for calculation of the variables depending on it. I believe it should take around 30 min to implement those changes and check everything.

JyothsnaKomaragiri commented 4 years ago

@chernyavskaya The main impact of the regression is to improve Mjj resolution in Likelihood fitting. so, it would have no visible effect anywhere else. In that sense this PR is correct and self consistent. The changes you propose would have a cosmetic effect and require retraining of the MVA's. Please let me know when you perform this retraining and I would be happy to create an other PR for the changes you have mentioned.

@panwarlsweet As HIG conveners said, you should study the effect on the retraining. However, concerning the code, the retraining part is irrelevant. Here I gave the comments that are imperative to have the code self-consistent. And it is needless to say that the code must be self-consistent to be merged in the master branch. The code will not be merged unless the variables are used in a consistent way, e.g as it is done with DNN-regression, and everything else that was developed for flashgg.

The small improvement of Mjj resolution might not have an effect on anything else, but if this will eventually be included in the publication, the regressed Mjj must be used for calculation of the variables depending on it. I believe it should take around 30 min to implement those changes and check everything.

@chernyavskaya this PR is done in a self-consistent way. What you suggest, needs new training and categorization with corrected Mjj and MX.

Note, conveners did not ask us about retraining [1]. Instead, they are willing to keep L2 integration without retraining. It is a totally minor effect on training and categorization using corrected MX and Mjj.

This PR we have made to study the improvement with existing training and categorization. We don't think it is the correct way to have a training file with uncorrected MX and Mjj and fill MVA output and do categorization using corrected MX and Mjj.

When you have new set of training files, we could make the changes you have suggested with another PR.

[1] https://hypernews.cern.ch/HyperNews/CMS/get/HIG-19-018/36/1/1.html

JyothsnaKomaragiri commented 4 years ago

@simonepigazzini thanks for approving the changes.

nadya-chernyavskaya commented 4 years ago

@JyothsnaKomaragiri The HIG conveners explicitly asked : "One important input is what is the gain in performance deploying the regression in the analysis as is (just refitting the signal shapes & yields) wrt also re-training the MVAs and re-optimizing category boundaries. That information is not in the presentations linked in your email: please provide it as soon as possible. "

This information has never been provided by you. However, this is not that point I am talking about. The code must be consistent, currently - it is not. I understand that the improvement from regression is basically negligible and therefore retraining is not necessary, however calling inconsistent use of variables - "cosmetics" is not correct. I do not see how we can describe such use in the paper. Even though it will not change the result, if you want this code to be part of flashgg, it has to be implemented coherently.

simonepigazzini commented 4 years ago

Hello,

to be clear the approval flag was raised for the changes requested previously, this discussion on including Mjj in the MX calculation is not related to that. I also agree that in case the Mjj regression is used and have no sizable impact on the categorization one should apply it consistently to avoid having to describe two different variables in the paper. This again is a point to be decided on the analysis side that will then need to be reflected in the code pushed by the this PR.

simone

@malcles @edjtscott

edjtscott commented 4 years ago

hi all - I think we need to have a consistent usage of one single mjj variable. I believe there are other cases in the analysis (photon IDMVA) where the BDT is trained before the variables are corrected, it's not a problem. As already stated the changes are expected to be very small so we should ensure consistency.

mgouzevi commented 4 years ago

hi all - I think we need to have a consistent usage of one single mjj variable. I believe there are other cases in the analysis (photon IDMVA) where the BDT is trained before the variables are corrected, it's not a problem. As already stated the changes are expected to be very small so we should ensure consistency.

Hi Ed,

I share your opinion. All this is negligible.

The point I xould like to make is that this PR is fully scientifically consistent and maximally optimal given the training we have now:

1) Mjj*C_L2 is used in the likelihood where it improves the result. 2) Mjj is used in the MVA output calculation consistently with the training.

This can be trivially explained in the text: "For the likelihood calculation Mjj*C_L2 was used." But since as you said it is a negligible effect this sentence can be also skipped.

============

What is proposed is to add on top of the wording "inconsistency" a second scientific inconsistency:

1) MjjC_L2 is used in the likelihood where it improves the result. 2) MjjC_L2 is used in the MVA output calculation while the training was done with Mjj.

We would have to write then in the paper: "The variable Mjj was used for MVA training, and Mjj*C_L2 for MVA output calculation". That may puzzle the reader.

Max

panwarlsweet commented 4 years ago

Hello @simonepigazzini and @chernyavskaya

Could you please tell me if I need to make more changes in the PR after discussion in the last WG meeting for the results with L2 regression? Sorry I couldn't attend last meeting.

simonepigazzini commented 4 years ago

I think the PR is ok. It has been integrated by Nadya, Soumya and Fabio to produce the unblinded results shown last week. I think we will have a new PR soon to clear up and commit the final version of the analysis. This PR will probably include this one too.

panwarlsweet commented 4 years ago

Thanks @simonepigazzini and sorry for my late respond.

Does this also means that I can go ahead for processing flat trees for resonant analysis using this PR or I should wait for final PR from Nadya/Fabio/Soumya? I want to include the L2 regression for resonant results as well.