Produce ntuples with AcausalPOETAnalyzer

caredg commented 3 years ago

Ntuplize collisions data and mc simulations

caredg commented 3 years ago

I have modified the scripts in the condor folder in the AcausalPOETAnalyzer. I am waiting for @JonaJJSJ-crypto to put a limit on the tracks so they do not overflow the ntuples' size. This is urgent....

JonaJJSJ-crypto commented 3 years ago

I have tested how change the size in the Output of the Analyzer for MC and Data .root files. the cut for pt>5GeV is already loaded in the analyzer

PTcut(GeV) | MC | DPhoton | #Events -- | -- | -- | -- 0 | 6.1M | 4.3M | 100 1 | - | 1.2M | 100 2 | - | 482K | 100 4 | - | 331K | 100 5 | 2.8M | 316K | 100 7 | 2.8M | 302K | 100 10 | 2.8M | 293K | 100

caredg commented 3 years ago

Ok, so if we stick to the the 5GeV cut, compared to 4.3MB, that is a reduction of 93% in the total size. If we applied this to the diphoton 2012B dataset, which weights 6.3 TB, the total size of the ntuple for this example dataset would be about 440 GB. This is still too large! We need to get close to tens of GB max for this to be manageable. What is the fraction of the weight that tracks are taking after this cut? Should we make a cut on a different collection or should we cut harder on tracks? Is this permisible or do we really need them to get the secondary vertices?

JonaJJSJ-crypto commented 3 years ago

I didn't check the actual reduction from a AOD file. A few moments ago I ran this test, in my case I reduced a 2.3GB AOD file to 17MB 0.7% reduction in size. This means that the output from the 6.3TB should be about 46GB. Is this size ok? or perhaps I should reduce this further?. All the other classes lack of cuts but I will recommend, if needed, to set a cuts in pt>5GeV as in other skimming like the one from Stephan. The other possible cuts I recommend that still produce secondary vertexes are 7GeV & 10GeV, and no further because the number of secondary vertices drastically decrease.

caredg commented 3 years ago

I made a test running the acausal POET with the 5GeV cut on trakcs ofver 5 root files (5 jobs) from the diphoton Run2012B dataset. The average output file (job output) weights ~25MB. If running over all 1612 files produced the same average output file, the total weight for this datset would be 1612*25MB ~ 43GB. Now, if the average size for diphoton Run2012 is the same, running over the 2719 files that this dataset has would produce ~68 GB. This is probably not the end of the world but it would be good to decrease it a bit more. I will try making a test with a cut of 10 GeV.

caredg commented 3 years ago

Forgot to mention that the reduction in size agrees with this, i.e., 0.7%.

caredg commented 3 years ago

A cut with 10 GeV does not make a lot of difference. Maybe we need to think about restricting events based on other objects, like photon pT, for example. However, this needs to be done as an EDFilter after looking at the final ntuple variables from the signal simulation (so we do not shoot ourselves in the foot). I will send these jobs to prepare signal ntuples ASAP.

JonaJJSJ-crypto commented 3 years ago

I have implemented cuts in pt for Electrons and photons too but the percentage reduces only to 0.65%, cuts in pt>10 for tracks is not a significant change neither. I also check if is possible to reduce size in other classes such as vertex yet this not seems to be a problem.

JonaJJSJ-crypto commented 3 years ago

I have implemented an EdFilter for cutting in events that lack off energetic electrons (pt>17) and a cut in eta that is still under testing (eta<2.1). This Produced a reduction of a 2.3GB data file into 5.7MB skimmed file. This correspond to a 0.2% reduction in the size meaning that 6.3TB could be reduced to around 15MB. The Downside of this is that 1 of my 100 simulated events was also cutted off. Meaning 1% of true events.

caredg commented 3 years ago

Ok, that is not good. This cuts should not reduce the signal at all..... we should really look at the signal distributions first.... They are coming....

JonaJJSJ-crypto commented 3 years ago

It seems that the problem was the cut in eta. I have remove this cut and this fix the problem with the simulated data. However the reduction in real data is about 2.5% wich means that 6.3TB should be reduced around 20.5GB.

caredg commented 3 years ago

I am testing with the latest AcausalPoet ntuplizer with 5 jobs over the TTbar sample. All the test and official output will live under this with the same password as usual. If successful, I will launch jobs for full

TTbar
LWSM200DnR
LWSM300DnR

Test seemed successful. Submitting full jobs for the datasets mentioned....

caredg commented 3 years ago

Ntuples are ready here under the directory signalStudy_round1. There, one can find the merged root files:

TTbar.root (full dataset, be aware that is 33GB for now)
LWSM200DnR.root (full dataset, should contain 10K events)
LWSM300DnR.root (full dataset, should contain 10K events)

caredg commented 3 years ago

@JonaJJSJ-crypto the new LWSM200DnR.root merged ntuple root file is in signalStudy_round4 in the same place

caredg commented 3 years ago

@JonaJJSJ-crypto, we need to prepare the close to final poet ntuplizer. We need to:

[x] Fix and add whatever variables are needed and try to be as meticulous as possible
[x] Fix and update the filter so we can reduce the size of the ntuples much further (#43) The smaller they are the easier it is to regenerate them in case of need and they will run much faster.

caredg commented 2 years ago

@JonaJJSJ-crypto, at the usual place and in the cajuela disk, I am copying the newly produced root files in the analysis_round1 directory. These ntuples were produced with the simple electron/jet filter and with a cut of 5GeV for writing out tracks, nothing more. So far we have:

All signal samples
All data samples
For backgrounds: DYJetsToLL, TTbar, W[1,2,3]JetsToLNu.root, WWJetsTo2L2Nu.root, WZJetsTo2L2Q.root

caredg commented 2 years ago

@JonaJJSJ-crypto los backgrounds están casi completos. Solo fataría TTbarZ, pero la sección eficaz parece despreciable, no?

JonaJJSJ-crypto commented 2 years ago

Creo que comparado con los grandes si seria despreciable

caredg commented 2 years ago

@JonaJJSJ-crypto After this, I think I am ready to reprocess everything. I will start by running the signal simulation again, but make sure you have what you need in the analyzers that make up the ntuples. Let me know

JonaJJSJ-crypto commented 2 years ago

@caredg I would like to end testing all the things I had planned to solve our trigger object problem #52

caredg commented 2 years ago

@JonaJJSJ-crypto, the new ntuples (except for signal 400 ad 500, which are still being produced) can be found at the same place under the analysis_round3 directory. A similar location can be found in the cajuela repository.

caredg commented 2 years ago

@JonaJJSJ-crypto

[ ] Pendiente resolver si aplicar o no el filtro de trigger (creo que esto no necesitamos a menos que volvamos a tener problemas de tamaño)
[x] Introducir estos cortes para el pT de electrones y jets.
[ ] Añadir el algoritmo de pre-selección de vértices secundarios
[x] Cambiar la fuente de vertices primarios a la que toma en cuenta el beam spot? #39

JonaJJSJ-crypto commented 2 years ago

@JonaJJSJ-crypto

* [ ]   Pendiente resolver si aplicar o no el filtro de trigger (creo que esto no necesitamos a menos que volvamos a tener problemas de tamaño)

* [x]  Introducir [estos](https://github.com/JonaJJSJ-crypto/Proyecto-de-Tesis/issues/43#issuecomment-964330463) cortes para el pT de electrones y jets.

* [ ]  Añadir el algoritmo de pre-selección de vértices secundarios

* [x]  Cambiar la fuente de vertices primarios a la que toma en cuenta el beam spot? [Check if we suffer from beam spot issues #39](https://github.com/JonaJJSJ-crypto/Proyecto-de-Tesis/issues/39)

@caredg hasta ahora los se agregaron los cortes en pt de electrones y jets. Y Se agrego los vertices primarios del beamspot sin embargo se conservaron tambien los antiguos vertices primarios. En cuanto al algoritmo de merging de vertices secundarios se esta implementando un nuevo metodo para distinguir la distancia ede los vertices al momento de hacer el merge.

caredg commented 2 years ago

@JonaJJSJ-crypto, he producido ntuplas para LWSM200DnR y DYJetsToLL. El código necesita ahora un wall time de 4 horas para que todos los trabajos se completen, sobretodo el sample de LW. Si bien el sample de LW pesa menos que el anterior, el de DY pesa más. Los archivos se los puede encontrar en el lugar de siempre en el directorio analysis_round4. No me puedo conectar al minicluster para copiarlos ahí. Veré que es lo que pasa mañana.

JonaJJSJ-crypto commented 2 years ago

@caredg yo tengo el mismo problema con el minicluster. Parece que aun no esta el archivo de LW en ninguna carpeta.

JonaJJSJ-crypto commented 2 years ago

@caredg he terminado de chequear que la aplicacion del filtro con trigger es viable. El estudio es el siguiente, se selecionaron aquellos eventos cuyo electron de mayor energia supera los 40GeV en pt y en segundo mas energetico dupera los 25GeV en pt. De estos eventos, se cuenta aquellos que pasaron y no pasaron el trigger y estos son los resultados que obtuve.

LW200: De un total de 150000 eventos. 9147 eventos seleccionados no pasaron el trigger, 110932 eventos seleecionados pasaron el trigger. Porcentaje: 7.6%
DY: De un total de 5096918 eventos. 250598 eventos seleccionados no pasaron el trigger, 2620417 eventos seleecionados pasaron el trigger. Porcentaje: 8.7%

caredg commented 2 years ago

@JonaJJSJ-crypto, no etiendo muy bien. ¿Cuántos eventos, de los 150K originales, pasan la selección de 40 y 25 GeV? ¿Es 120079, i.e., 9147+110932, cierto? Y de estos 120079, pasan el trigger 110932, cierto? Es decir, pasan el 92.38% (supongo que eso da 7.6% de que no pasan). Ok, si es así, coincide con la "eficiencia" que se reporta acá Es decir, parece que si es la eficiencia....

JonaJJSJ-crypto commented 2 years ago

@caredg Si justamente es lo que tu dices. Y si parece que esa es la eficiencia. Es necesario correr este analysis con los datos?

caredg commented 2 years ago

@JonaJJSJ-crypto, no, no con datos. En datos habría que hacer todo el proceso de tag and probe para estimar esto, pero ~~hay tiempo~~ no hay tiempo, así que tomaremos los datos de la literatura.

caredg commented 2 years ago

@JonaJJSJ-crypto, no hay tiempo, quise decir

caredg commented 2 years ago

I am removing all the signalStudy_round? directories from our repository to free some space.

caredg commented 2 years ago

@JonaJJSJ-crypto estoy produciendo las ntuplas que incluyen:

el nuevo algorítmo de vértices secundarios #53
el filtro del trigger #57, requiriendo el OR de los triggers sin prescala
al menos dos electrones con pT>15GeV y al menos 4 jets con pT>10GeV Los archivos para la señal, datos y backgrounds más importantes están en analysis_round5 en los mismo sitios de siempre.

caredg commented 2 years ago

Voy a eliminar analysis_round1 y analysis_round2 del repositorio principal para ganar espacio.

caredg commented 2 years ago

Las ntuplas que incorporan los discutido aquí están en la carpeta analysis_round6 en los repositorios usuale.

caredg commented 2 years ago

@JonaJJSJ-crypto, las ntuplas para LW200 y DY solicitadas aquí se encuentran en analysis_round7 en los repositorios usuales.

caredg commented 2 years ago

@JonaJJSJ-crypto, nuevas ntuplas para LW200 y DY solicitadas en encuentran en analaysis_round8 en los repositorios usuales

caredg commented 2 years ago

@JonaJJSJ-crypto, las nuevas ntuplas están listas en analysis_round11 en los repositorios usuales.

JonaJJSJ-crypto commented 2 years ago

Perfecto ya me pongo a generar los nuevos gráficos

JonaJJSJ-crypto / Proyecto-de-Tesis

Produce ntuples with AcausalPOETAnalyzer #33