cernopendata / opendata.cern.ch

Source code for the CERN Open Data portal
http://opendata.cern.ch/
GNU General Public License v2.0
661 stars 147 forks source link

CMS: Record for ML samples (tracker-hit-enriched Run1 AOD) #2575

Closed katilp closed 5 years ago

katilp commented 5 years ago

From #2444 Prepare the record for the root/h5 output.

This will appear as a separate record (similar to http://opendata-dev.cern.ch/record/12100). The underlying json will be in cms-derived-Run1-datascience.json (to be created in https://github.com/cernopendata/opendata.cern.ch/blob/master/cernopendata/modules/fixtures/data/records/)

If you are comfortable with the json description, you can suggest the contents directly to @ArtemisLav in json format (see in http://opendata-dev.cern.ch/record/12100/export/json), or you can give the content to the different fields here.

Needed in particular:

emanueleusai commented 5 years ago

Hi @katilp ,

Here's the information needed:

  • Dataset name from @emanueleusai

Samples with high-momentum jets for tracking, ML, and top quark tagging studies

  • Authors are Emanuele Usai, Michael Andrews, Bjorn Burkle, Sergei Gleyzer, Meenakshi Narain, orcid?

Emanuele Usai https://orcid.org/0000-0001-9323-2107

Michael Andrews https://orcid.org/0000-0001-5537-4518

Bjorn Burkle https://orcid.org/0000-0003-1645-822X

Sergei Gleyzer https://orcid.org/0000-0002-6222-8102

Meenakshi Narain https://orcid.org/0000-0002-7857-7403

  • @emanueleusai will provide a description (for "abstract")

The dataset consists of hits from the tracking detector, reconstructed tracks, simulated tracks, generated particles, and jets clustered from the generated particles. The various objects are matched in order to reconstruct the provenance of the various hits. Samples of events containing light jets (QCD) in various energy ranges have been produced. Additionally a sample containing all-hadronic high transverse momentum decays of top quarks have been produced.
The dataset consists of events extracted from simulated proton-proton collision events at a center-of-mass energy of 8 TeV generated with Pythia 6 (QCD) or MadGraph2.6 and Pythia6 (top-antitop pair sample). The particles emerging from the collisions traverse through a simulation of the CMS detector. Samples labeled as "step2" are in the standard CMS format "AOD" plus a series of low-level tracker-related collections that allow the extraction of the tracker hits. Samples labeled as "step3" are in a custom root ntuple format and contain the position of the hits and information from the generator-level objects associated to the tracker hits. The samples can be used to study top quark identification algorithms that use low-level detector information such as tracker hits. Machine learning algorithms are suitable for this classification task.

  • Dataset characteristics (@emanueleusai can provide N events/entries - @tiborsimko will check n files)
Data set name Description Number of events Number of files
QCD300to600 QCD, flat pT hat spectrum, 300 < pT hat < 600 GeV 1497600 2496
QCD400to600 QCD, flat pT hat spectrum, 400 < pT hat < 600 GeV 1989000 3315
QCD600to3000 QCD, flat pT hat spectrum, 600 < pT hat < 3000 GeV 2974800 4959
ttbar ttbar, fully hadronic decays, pT of the top/antitop greater than 400 GeV 2969109 4055
  • Dataset semantics table from @emanueleusai
Data variable Type Description
hit_global_x std::vector<float> global x position of the RecHit
hit_global_y std::vector<float> global y position of the RecHit
hit_global_z std::vector<float> global z position of the RecHit
hit_local_x std::vector<float> x pos. of the hit in the local sensor coordinate
hit_local_y std::vector<float> y pos. of the hit in the local sensor coordinate
hit_local_x_error std::vector<float> x error in the local sensor coordinate
hit_local_y_error std::vector<float> y error in the local sensor coordinate
hit_sub_det std::vector<unsigned int> subdetector generating the hit [1]
hit_layer std::vector<unsigned int> layer/disk of the subdetector generating the hit
hit_type std::vector<unsigned int> type of sistrip hit [2]
hit_simtrack_id std::vector<int> ID number of the sim track matched to the hit
hit_simtrack_index std::vector<unsigned int> index of the sim track matched to the hit
hit_simtrack_match std::vector<bool> is the hit matched to a sim track?
hit_genparticle_id std::vector<unsigned int> index of the gen particle matched to the hit
hit_pdgid std::vector<int> PDG ID of the gen particle matched to the hit
hit_recotrack_id std::vector<unsigned int> index of the reco track matched to the hit
hit_recotrack_match std::vector<bool> is the hit matched to a reco track?
hit_genparticle_match std::vector<bool> is the hit matched to a gen particle?
hit_genjet_id std::vector<unsigned int> index of the gen jet matched to the hit
hit_genjet_match std::vector<bool> is the hit matched to a gen jet?
simtrack_id std::vector<unsigned int> ID number of the sim track
simtrack_pdgid std::vector<int> PDG ID of the sim track
simtrack_charge std::vector<int> charge of the sim track
simtrack_px std::vector<float> momentum x component of the sim track
simtrack_py std::vector<float> momentum y component of the sim track
simtrack_pz std::vector<float> momentum z component of the sim track
simtrack_energy std::vector<float> energy of the sim track
simtrack_vtxid std::vector<unsigned int> ID number of the sim vertex of the sim track
simtrack_genid std::vector<unsigned int> index of the gen particle associated to the track
simtrack_evtid std::vector<uint32_t> event ID of the sim track
genpart_collid std::vector<int> collision ID of the gen particle
genpart_pdgid std::vector<int> PDG ID of the gen particle
genpart_charge std::vector<int> charge of the gen particle
genpart_px std::vector<float> momentum x component of the gen particle
genpart_py std::vector<float> momentum y component of the gen particle
genpart_px std::vector<float> momentum z component of the gen particle
genpart_energy std::vector<float> energy of the gen particle
genpart_status std::vector<int> PDG status of the gen particle
genjet_px std::vector<float> momentum x component of the gen jet
genjet_py std::vector<float> momentum y component of the gen jet
genjet_pz std::vector<float> momentum z component of the gen jet
genjet_energy std::vector<float> energy of the gen jet
genjet_emEnergy std::vector<float> electromagnetic energy of the gen jet
genjet_hadEnergy std::vector<float> hadronic energy of the gen jet
genjet_invisibleEnergy std::vector<float> invisible energy of the gen jet
genjet_auxiliaryEnergy std::vector<float> auxiliary energy of the gen jet
genjet_const_collid std::vector<std::vector<int> > collision ID of the constituent of the gen jet
genjet_const_pdgid std::vector<std::vector<int> > PDG ID of the constituent of the gen jet
genjet_const_charge std::vector<std::vector<int> > charge of the constituent of the gen jet
genjet_const_px std::vector<std::vector<float> > momentum x component of the constituent of the gen jet
genjet_const_py std::vector<std::vector<float> > momentum y component of the constituent of the gen jet
genjet_const_pz std::vector<std::vector<float> > momentum z component of the constituent of the gen jet
genjet_const_energy std::vector<std::vector<float> > energy of the constituent of the gen jet
track_chi2 std::vector<float> chi2 of the reco track fit
track_ndof std::vector<float> ndof of the reco track fit
track_chi2ndof std::vector<float> reduced chi2 of the reco track fit
track_charge std::vector<float> charge of the reco track
track_momentum std::vector<float> momentum of the reco track
track_pt std::vector<float> transverse momentum of the reco track
track_pterr std::vector<float> error on the transverse momentum of the reco track
track_hitsvalid std::vector<unsigned int> number of valid hits in the reco track
track_hitslost std::vector<unsigned int> number of lost hits in the reco track
track_theta std::vector<float> theta angle of the reco track
track_thetaerr std::vector<float> error on theta of the reco track
track_phi std::vector<float> phi angle of the reco track
track_phierr std::vector<float> error on phi of the reco track
track_eta std::vector<float> pseudorapidity of the reco track
track_etaerr std::vector<float> error on pseudorapidity of the reco track
track_dxy std::vector<float> transverse impact parameter of the reco track
track_dxyerr std::vector<float> error on the transverse impact parameter of the reco track
track_dsz std::vector<float> longitudinal impact parameter of the reco track
track_dszerr std::vector<float> error on the longitudinal impact parameter of the reco track
track_qoverp std::vector<float> charge over momentum of the reco track
track_qoverperr std::vector<float> error on charge over momentum of the track
track_vx std::vector<float> x position of the vertex of the reco track
track_vy std::vector<float> y position of the vertex of the reco track
track_vz std::vector<float> z position of the vertex of the reco track
track_algo std::vector<Int_t> algorithm type of the reco track
track_hit_global_x std::vector<std::vector<float> > global x position of the RecHit associated to the reco track
track_hit_global_y std::vector<std::vector<float> > global y position of the RecHit associated to the reco track
track_hit_global_z std::vector<std::vector<float> > global z position of the RecHit associated to the reco track
track_hit_local_x std::vector<std::vector<float> > local x position of the RecHit associated to the reco track
track_hit_local_y std::vector<std::vector<float> > local y position of the RecHit associated to the reco track
track_hit_local_x_error std::vector<std::vector<float> > error on local x position of the RecHit associated to the reco track
track_hit_local_y_error std::vector<std::vector<float> > error on local y position of the RecHit associated to the reco track
track_hit_sub_det std::vector<std::vector<unsigned int> > subdetector generating the hit [1] associated to the reco track
track_hit_layer std::vector<std::vector<unsigned int> > layer/disk of the subdetector generating the hit associated to the reco track

[1] 1 PixelBarrel, 2 PixelEndcap, 3 TIB, 4 TID, 5 TOB, 6 TEC

[2] 0 Pixel hit, 1 rphiRecHit, 2 stereoRecHit, 3 rphiRecHitUnmatched, 4 stereoRecHitUnmatched

  • Related datasets from #2526

See: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/tree/master/lists

The data were generated in different steps: For QCD MC: Step0 (GEN-SIM) --> Step 1 (GEN-SIM-RAW) --> Step 2 (AOD) --> Step3 (Ntuples)

For ttbar MC: LHE --> Step0 (GEN-SIM) --> Step 1 (GEN-SIM-RAW) --> Step 2 (AOD) --> Step3 (Ntuples)

The configuration files used can be found here: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/tree/master/configs and the MadGraph cards for the ttbar sample: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/tree/master/cards

Detailed instructions on how to reproduce the samples can be found here: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/blob/master/README.md

  • How can you use these data? @emanueleusai can you give a brief text and and eventually a link to an example code?

An example is provided here https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/tree/master/example and instructions on how to run it are provided here: https://github.com/cms-legacydata-analyses/TrackerRecHitProducerTool/blob/master/README.md

The code read the ntuples and produces a scatter plot of the rechits from three events.

Let me know if additional information is needed. Cheers, Emanuele

emanueleusai commented 5 years ago

More precisely, here are the number of files and number of events for step2 and step3:

Number of files:

4996 step2_QCD300to600_OD 3315 step2_QCD400to600_OD 4994 step2_QCD_600to3000_01_OD 4925 step2_QCD_600to3000_02_OD 3299 step2_ttbarOD_01_OD 3095 step2_ttbarOD_02_OD 3325 step2_ttbarOD_03_OD 3273 step2_ttbarOD_04_OD 3225 step2_ttbarOD_05_OD

2496 step3_QCD300to600_OD 3315 step3_QCD400to600_OD 4959 step3_QCD600to3000_OD 4055 step3_ttbarOD 4055 step3_ttbarOD_OD

Number of events:

  ==> step2_QCD300to600_OD_count <==

total: 1498800

==> step2_QCD400to600_OD_count <== total: 1989000

==> step2_QCD_600to3000_01_OD_count <== total: 1498000

==> step2_QCD_600to3000_02_OD_count <== total: 1477400

==> step2_ttbarOD_01_OD_count <== total: 604177

==> step2_ttbarOD_02_OD_count <== total: 566482

==> step2_ttbarOD_03_OD_count <== total: 609001

==> step2_ttbarOD_04_OD_count <== total: 598630

==> step2_ttbarOD_05_OD_count <== total: 590819

==> step3_QCD300to600_OD_count <== total: 1497600

==> step3_QCD400to600_OD_count <== total: 1989000

==> step3_QCD600to3000_OD_count <== total: 2974800

==> step3_ttbarOD_OD_count <== total: 2969109

katilp commented 5 years ago

@emanueleusai Perfect, thanks!!

tiborsimko commented 5 years ago

@emanueleusai I just cross-checked the number of files and there seem to be one difference with the numbers you mention:

$ eos ls /eos/opendata/cms/datascience/TrackerRecHitProducerTool/QCD600to3000_RunI_8TeV/step2_QCD_600to3000_01 | grep -c .root$
4983

while you mentioned 4994 in the comment above.

(1) Is this OK or are we missing some QCD600to3000 step2 files?

(2) I noticed that QCD600to3000 uses directory name QCD_600to3000 for step2 and QCD600to3000 for step 3. Note the extra underscore... Is this wanted or do you want me to remove the underscore and harmonise the names as for the other datasets?

emanueleusai commented 5 years ago

@tiborsimko

@emanueleusai I just cross-checked the number of files and there seem to be one difference with the numbers you mention:

$ eos ls /eos/opendata/cms/datascience/TrackerRecHitProducerTool/QCD600to3000_RunI_8TeV/step2_QCD_600to3000_01 | grep -c .root$
4983

while you mentioned 4994 in the comment above.

(1) Is this OK or are we missing some QCD600to3000 step2 files?

It's ok. It's a mistake on my side.

(2) I noticed that QCD600to3000 uses directory name QCD_600to3000 for step2 and QCD600to3000 for step 3. Note the extra underscore... Is this wanted or do you want me to remove the underscore and harmonise the names as for the other datasets?

Yes, please go on and remove the underscore to harmonize the naming

Cheers, Emanuele

katilp commented 5 years ago

Close as completed