Closed katilp closed 5 years ago
Hi @katilp ,
Here's the information needed:
- Dataset name from @emanueleusai
Samples with high-momentum jets for tracking, ML, and top quark tagging studies
- Authors are Emanuele Usai, Michael Andrews, Bjorn Burkle, Sergei Gleyzer, Meenakshi Narain, orcid?
Emanuele Usai https://orcid.org/0000-0001-9323-2107
Michael Andrews https://orcid.org/0000-0001-5537-4518
Bjorn Burkle https://orcid.org/0000-0003-1645-822X
Sergei Gleyzer https://orcid.org/0000-0002-6222-8102
Meenakshi Narain https://orcid.org/0000-0002-7857-7403
- @emanueleusai will provide a description (for "abstract")
The dataset consists of hits from the tracking detector, reconstructed tracks, simulated tracks, generated particles, and jets clustered from the generated particles. The various objects are matched in order to reconstruct the provenance of the various hits.
Samples of events containing light jets (QCD) in various energy ranges have been produced. Additionally a sample containing all-hadronic high transverse momentum decays of top quarks have been produced.
The dataset consists of events extracted from simulated proton-proton collision events at a center-of-mass energy of 8 TeV generated with Pythia 6 (QCD) or MadGraph2.6 and Pythia6 (top-antitop pair sample). The particles emerging from the collisions traverse through a simulation of the CMS detector.
Samples labeled as "step2" are in the standard CMS format "AOD" plus a series of low-level tracker-related collections that allow the extraction of the tracker hits. Samples labeled as "step3" are in a custom root ntuple format and contain the position of the hits and information from the generator-level objects associated to the tracker hits.
The samples can be used to study top quark identification algorithms that use low-level detector information such as tracker hits. Machine learning algorithms are suitable for this classification task.
- Dataset characteristics (@emanueleusai can provide N events/entries - @tiborsimko will check n files)
Data set name | Description | Number of events | Number of files |
---|---|---|---|
QCD300to600 | QCD, flat pT hat spectrum, 300 < pT hat < 600 GeV | 1497600 | 2496 |
QCD400to600 | QCD, flat pT hat spectrum, 400 < pT hat < 600 GeV | 1989000 | 3315 |
QCD600to3000 | QCD, flat pT hat spectrum, 600 < pT hat < 3000 GeV | 2974800 | 4959 |
ttbar | ttbar, fully hadronic decays, pT of the top/antitop greater than 400 GeV | 2969109 | 4055 |
- Dataset semantics table from @emanueleusai
Data variable | Type | Description |
---|---|---|
hit_global_x | std::vector<float> |
global x position of the RecHit |
hit_global_y | std::vector<float> |
global y position of the RecHit |
hit_global_z | std::vector<float> |
global z position of the RecHit |
hit_local_x | std::vector<float> |
x pos. of the hit in the local sensor coordinate |
hit_local_y | std::vector<float> |
y pos. of the hit in the local sensor coordinate |
hit_local_x_error | std::vector<float> |
x error in the local sensor coordinate |
hit_local_y_error | std::vector<float> |
y error in the local sensor coordinate |
hit_sub_det | std::vector<unsigned int> |
subdetector generating the hit [1] |
hit_layer | std::vector<unsigned int> |
layer/disk of the subdetector generating the hit |
hit_type | std::vector<unsigned int> |
type of sistrip hit [2] |
hit_simtrack_id | std::vector<int> |
ID number of the sim track matched to the hit |
hit_simtrack_index | std::vector<unsigned int> |
index of the sim track matched to the hit |
hit_simtrack_match | std::vector<bool> |
is the hit matched to a sim track? |
hit_genparticle_id | std::vector<unsigned int> |
index of the gen particle matched to the hit |
hit_pdgid | std::vector<int> |
PDG ID of the gen particle matched to the hit |
hit_recotrack_id | std::vector<unsigned int> |
index of the reco track matched to the hit |
hit_recotrack_match | std::vector<bool> |
is the hit matched to a reco track? |
hit_genparticle_match | std::vector<bool> |
is the hit matched to a gen particle? |
hit_genjet_id | std::vector<unsigned int> |
index of the gen jet matched to the hit |
hit_genjet_match | std::vector<bool> |
is the hit matched to a gen jet? |
simtrack_id | std::vector<unsigned int> |
ID number of the sim track |
simtrack_pdgid | std::vector<int> |
PDG ID of the sim track |
simtrack_charge | std::vector<int> |
charge of the sim track |
simtrack_px | std::vector<float> |
momentum x component of the sim track |
simtrack_py | std::vector<float> |
momentum y component of the sim track |
simtrack_pz | std::vector<float> |
momentum z component of the sim track |
simtrack_energy | std::vector<float> |
energy of the sim track |
simtrack_vtxid | std::vector<unsigned int> |
ID number of the sim vertex of the sim track |
simtrack_genid | std::vector<unsigned int> |
index of the gen particle associated to the track |
simtrack_evtid | std::vector<uint32_t> |
event ID of the sim track |
genpart_collid | std::vector<int> |
collision ID of the gen particle |
genpart_pdgid | std::vector<int> |
PDG ID of the gen particle |
genpart_charge | std::vector<int> |
charge of the gen particle |
genpart_px | std::vector<float> |
momentum x component of the gen particle |
genpart_py | std::vector<float> |
momentum y component of the gen particle |
genpart_px | std::vector<float> |
momentum z component of the gen particle |
genpart_energy | std::vector<float> |
energy of the gen particle |
genpart_status | std::vector<int> |
PDG status of the gen particle |
genjet_px | std::vector<float> |
momentum x component of the gen jet |
genjet_py | std::vector<float> |
momentum y component of the gen jet |
genjet_pz | std::vector<float> |
momentum z component of the gen jet |
genjet_energy | std::vector<float> |
energy of the gen jet |
genjet_emEnergy | std::vector<float> |
electromagnetic energy of the gen jet |
genjet_hadEnergy | std::vector<float> |
hadronic energy of the gen jet |
genjet_invisibleEnergy | std::vector<float> |
invisible energy of the gen jet |
genjet_auxiliaryEnergy | std::vector<float> |
auxiliary energy of the gen jet |
genjet_const_collid | std::vector<std::vector<int> > |
collision ID of the constituent of the gen jet |
genjet_const_pdgid | std::vector<std::vector<int> > |
PDG ID of the constituent of the gen jet |
genjet_const_charge | std::vector<std::vector<int> > |
charge of the constituent of the gen jet |
genjet_const_px | std::vector<std::vector<float> > |
momentum x component of the constituent of the gen jet |
genjet_const_py | std::vector<std::vector<float> > |
momentum y component of the constituent of the gen jet |
genjet_const_pz | std::vector<std::vector<float> > |
momentum z component of the constituent of the gen jet |
genjet_const_energy | std::vector<std::vector<float> > |
energy of the constituent of the gen jet |
track_chi2 | std::vector<float> |
chi2 of the reco track fit |
track_ndof | std::vector<float> |
ndof of the reco track fit |
track_chi2ndof | std::vector<float> |
reduced chi2 of the reco track fit |
track_charge | std::vector<float> |
charge of the reco track |
track_momentum | std::vector<float> |
momentum of the reco track |
track_pt | std::vector<float> |
transverse momentum of the reco track |
track_pterr | std::vector<float> |
error on the transverse momentum of the reco track |
track_hitsvalid | std::vector<unsigned int> |
number of valid hits in the reco track |
track_hitslost | std::vector<unsigned int> |
number of lost hits in the reco track |
track_theta | std::vector<float> |
theta angle of the reco track |
track_thetaerr | std::vector<float> |
error on theta of the reco track |
track_phi | std::vector<float> |
phi angle of the reco track |
track_phierr | std::vector<float> |
error on phi of the reco track |
track_eta | std::vector<float> |
pseudorapidity of the reco track |
track_etaerr | std::vector<float> |
error on pseudorapidity of the reco track |
track_dxy | std::vector<float> |
transverse impact parameter of the reco track |
track_dxyerr | std::vector<float> |
error on the transverse impact parameter of the reco track |
track_dsz | std::vector<float> |
longitudinal impact parameter of the reco track |
track_dszerr | std::vector<float> |
error on the longitudinal impact parameter of the reco track |
track_qoverp | std::vector<float> |
charge over momentum of the reco track |
track_qoverperr | std::vector<float> |
error on charge over momentum of the track |
track_vx | std::vector<float> |
x position of the vertex of the reco track |
track_vy | std::vector<float> |
y position of the vertex of the reco track |
track_vz | std::vector<float> |
z position of the vertex of the reco track |
track_algo | std::vector<Int_t> |
algorithm type of the reco track |
track_hit_global_x | std::vector<std::vector<float> > |
global x position of the RecHit associated to the reco track |
track_hit_global_y | std::vector<std::vector<float> > |
global y position of the RecHit associated to the reco track |
track_hit_global_z | std::vector<std::vector<float> > |
global z position of the RecHit associated to the reco track |
track_hit_local_x | std::vector<std::vector<float> > |
local x position of the RecHit associated to the reco track |
track_hit_local_y | std::vector<std::vector<float> > |
local y position of the RecHit associated to the reco track |
track_hit_local_x_error | std::vector<std::vector<float> > |
error on local x position of the RecHit associated to the reco track |
track_hit_local_y_error | std::vector<std::vector<float> > |
error on local y position of the RecHit associated to the reco track |
track_hit_sub_det | std::vector<std::vector<unsigned int> > |
subdetector generating the hit [1] associated to the reco track |
track_hit_layer | std::vector<std::vector<unsigned int> > |
layer/disk of the subdetector generating the hit associated to the reco track |
[1] 1 PixelBarrel, 2 PixelEndcap, 3 TIB, 4 TID, 5 TOB, 6 TEC
[2] 0 Pixel hit, 1 rphiRecHit, 2 stereoRecHit, 3 rphiRecHitUnmatched, 4 stereoRecHitUnmatched
- Related datasets from #2526
See: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/tree/master/lists
- "How were these data generated?" (or produced) with a link to SW record to be created from https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool
The data were generated in different steps: For QCD MC: Step0 (GEN-SIM) --> Step 1 (GEN-SIM-RAW) --> Step 2 (AOD) --> Step3 (Ntuples)
For ttbar MC: LHE --> Step0 (GEN-SIM) --> Step 1 (GEN-SIM-RAW) --> Step 2 (AOD) --> Step3 (Ntuples)
The configuration files used can be found here: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/tree/master/configs and the MadGraph cards for the ttbar sample: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/tree/master/cards
Detailed instructions on how to reproduce the samples can be found here: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/blob/master/README.md
- How can you use these data? @emanueleusai can you give a brief text and and eventually a link to an example code?
An example is provided here https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/tree/master/example and instructions on how to run it are provided here: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/blob/master/README.md
The code read the ntuples and produces a scatter plot of the rechits from three events.
Let me know if additional information is needed. Cheers, Emanuele
More precisely, here are the number of files and number of events for step2 and step3:
Number of files:
4996 step2_QCD300to600_OD 3315 step2_QCD400to600_OD 4994 step2_QCD_600to3000_01_OD 4925 step2_QCD_600to3000_02_OD 3299 step2_ttbarOD_01_OD 3095 step2_ttbarOD_02_OD 3325 step2_ttbarOD_03_OD 3273 step2_ttbarOD_04_OD 3225 step2_ttbarOD_05_OD
2496 step3_QCD300to600_OD 3315 step3_QCD400to600_OD 4959 step3_QCD600to3000_OD 4055 step3_ttbarOD 4055 step3_ttbarOD_OD
Number of events:
==> step2_QCD300to600_OD_count <==
total: 1498800
==> step2_QCD400to600_OD_count <== total: 1989000
==> step2_QCD_600to3000_01_OD_count <== total: 1498000
==> step2_QCD_600to3000_02_OD_count <== total: 1477400
==> step2_ttbarOD_01_OD_count <== total: 604177
==> step2_ttbarOD_02_OD_count <== total: 566482
==> step2_ttbarOD_03_OD_count <== total: 609001
==> step2_ttbarOD_04_OD_count <== total: 598630
==> step2_ttbarOD_05_OD_count <== total: 590819
==> step3_QCD300to600_OD_count <== total: 1497600
==> step3_QCD400to600_OD_count <== total: 1989000
==> step3_QCD600to3000_OD_count <== total: 2974800
==> step3_ttbarOD_OD_count <== total: 2969109
@emanueleusai Perfect, thanks!!
@emanueleusai I just cross-checked the number of files and there seem to be one difference with the numbers you mention:
$ eos ls /eos/opendata/cms/datascience/TrackerRecHitProducerTool/QCD600to3000_RunI_8TeV/step2_QCD_600to3000_01 | grep -c .root$
4983
while you mentioned 4994 in the comment above.
(1) Is this OK or are we missing some QCD600to3000 step2 files?
(2) I noticed that QCD600to3000 uses directory name QCD_600to3000
for step2 and QCD600to3000
for step 3. Note the extra underscore... Is this wanted or do you want me to remove the underscore and harmonise the names as for the other datasets?
@tiborsimko
@emanueleusai I just cross-checked the number of files and there seem to be one difference with the numbers you mention:
$ eos ls /eos/opendata/cms/datascience/TrackerRecHitProducerTool/QCD600to3000_RunI_8TeV/step2_QCD_600to3000_01 | grep -c .root$ 4983
while you mentioned 4994 in the comment above.
(1) Is this OK or are we missing some QCD600to3000 step2 files?
It's ok. It's a mistake on my side.
(2) I noticed that QCD600to3000 uses directory name
QCD_600to3000
for step2 andQCD600to3000
for step 3. Note the extra underscore... Is this wanted or do you want me to remove the underscore and harmonise the names as for the other datasets?
Yes, please go on and remove the underscore to harmonize the naming
Cheers, Emanuele
Close as completed
From #2444 Prepare the record for the root/h5 output.
This will appear as a separate record (similar to http://opendata-dev.cern.ch/record/12100). The underlying json will be in cms-derived-Run1-datascience.json (to be created in https://github.com/cernopendata/opendata.cern.ch/blob/master/cernopendata/modules/fixtures/data/records/)
If you are comfortable with the json description, you can suggest the contents directly to @ArtemisLav in json format (see in http://opendata-dev.cern.ch/record/12100/export/json), or you can give the content to the different fields here.
Needed in particular: