Closed rfriese closed 8 years ago
Dear Raphael, Rene, Alex,
first of all thanks for this effort to enable the group to work on the new data samples. I fully support a merge into the master as soon as possible.
I have a few comments and questions to this very big pull request. Some of them I have already brought up in private discussions, but I think we should all think about it.
I saw, that the file samples_run2_2016.py has been derived from a copy of samples_run2.py and I am not so sure, that this is a good idea. The files samples*.py are mainly a result of Fabio's and my developments during the phase where we rebuilt the official 8TeV analysis and it was planned in close contact with Andrew, who shared with us his experiences with this topic in run I. The result are these fairly complicated files, that I consider mainly as configuration files. They solve the complicated issue of background estimations. (A simple solution for a complicated problem is quite difficult and I have not yet heard about better ideas.) By now having two basically identical configurations, one for 2015 and one for 2016 samples, we face the problem, that we will have to apply changes regarding the background estimation in both files. This would require a lot of C&P and due to the structure of these files, I am sure, that it is basically impossible to do that without stupid C&P mistakes (that might even be hard to discover). If there are only minor differences, I would propose to do them rather in the same file instead of splitting them into different files. What do you think?
My original idea (based on my experiences with our old run I code) was to have nicknames, that basically never change for a given sample. Since we now have a semi-automatic nickname generation in Kappa, we have to solve this issue at analysis/plotting level. I am thinking of people that are not as experienced as you Raphael, Rene and Alex, for whom I would like to have code and configurations, where the default gives already meaningful results. Therefore my suggestion would be, to put at least the generator name into the filenames in samples_run2.py (instead of the wildcard we have now). What do you think about it? Do you think we can find a solution, that does not need frequent changes due to changing nicknames? (I know that you will argue, that the nicknames do not change very often. But the fact, that we think of the file samples_run2_2016.py shows, that there are differences.)
Then a question related to Rene's work on the HLTs in the new samples. How simple is it to go back to the version we have for 2015, in case we get back the trigger information in the new samples?
My last comment is about the "size" of the pull request. I understand, that this work required some more changes than usual developments. However, I see that not all commits are related to the new samples (e.g. changes in python scripts) and I noticed, that the changes have been pushed in very large chunks. I hope, that I am not the only one trying to read and understand all commits, because this helps to get experience and to find mistakes (I already found some which have also been solved). This time I found it quite difficult to get an overview of all changes (Github even cannot display them). Therefore I would like to kindly appeal to keep changes on development branches as small as possible (maybe rather use more branches, like here for example for the HLT-related changes) and to push rather sooner than later, to give colleagues a chance (and motivate them) to digest the changes in smaller portions.
Cheers, Thomas
Hi Thomas,
just one small comment for now. To select whether to use the trigger information or not two configuration tags are used (NoHltFiltering and DiTauPairNoHLT). Hence it is easily possible to change this behavior for future samples.
Cheers, Rene
Hi Thomas, all,
I agree that we should find a better solution than having a samples_*.py file for every year of LHC runs since I also believe that this will at least lead to copy&paste mistakes. It would be nice if we could have nicknames that are static except for the era (Fall15, Run2016B, etc.) for example.
I also agree that we should try to commit only changes to the development branch that are specifically needed there. I realize of course that it is often a nuisance to stash local changes, switch to the master and commit there, then switch back and I am definitely also guilty of being a bit lazy in this regard at times. However, I'm not sure creating many different branches is practical since in some cases you need all of the changes to perform tests and then it becomes really annoying if you have multiple branches that you have to take care of. That is if the developments in some (or all) of the branches are not finished, of course.
Cheers, Alex
Hi, thanks for your comments.
Concerning the samples_ files and nicknames we have the situation that now we needed a change: we have in several datasets now "extensions". They need to be separable since they do have different numbers of generated events and eventually a different generatorWeight. I'm glad our nickname system is flexible and this was only a minor change.
My suggestion to prevent c&p is to do the sample selection the same way createHiggsFilelists.py does it using database ques. If my jobs keep on running so slow I can write a preview later today. Given that functionality, one can request the required samples with a minimal dependency on change of nicknames, Run Periods and so on. Then you can also specify a generator to not pick up e.g. two different generators at once.
The size of this PR is dominated by new sample lists, that's why github can't handle it. If someone has an idea how to reduce the size of this sample lists you're welcome, but I can't think of one. To see all changes do
git diff master CMSSW_8_0_X
I just checked the commits. I found one commit not explicitly targeting the Kappa dict changes, but as far as I can see it's already in the master as well. Rene did some adjustments in the standalone svfit-calculation, but this is related to the changed MVAMET and SVFit re-finding method, one should not have re-run svfit with the old skims anyway. Some commits by me target the lhe weights and are useless with the old skim. Some commits have been necessary for the sync. So getting back to you critics - as far as I can see we respected that and atm I can't see where master-relevant developments come in. Please be more clear what I've missed.
Dear all, we feel confident that we're ready to merge to the master. We're in sync on the sync sample, the plots are in agreement and the last fix addresses the MVA MET finding problem that occurred in some rare cases.
You can still run in a CMSSW7 or 8 environment. The merge also needs merges in Artus, Kappa and KappaTools where I will not create extra PRs because the changes there are minor.
The SVFit caches are still to be produced, this we can do in the next days.
Please speak now if you don't agree with the merge. I'll accept it tomorrow afternoon. Cheers