Additional information to be added for CMS MC datasets

katilp commented 8 years ago

Need the following additional information (important for research use) for MC datasets:

the cross section
the filter efficiency (if a physics filter, e.g. an additional pt or eta cut was applied)
for some generators, a "matching efficiency" The three numbers need to be multiplied to obtain the effective cross section, and the effective luminosity can then be calculated using this and the number of events in the dataset.

The recipe to extract this from https://cms-pdmv.cern.ch/mcm/ (CMS internal?)

The way to access to the cross section:

click on "Request"
click on "Output Dataset" (the generic one appearing on the open data portal)
enter the dataset name in the box "dataset name as shown in DAS"
click "Search"
click on the dataset name
the MC cross section value is in "Generator parameters"

If "Generator parameters" is not displayed, you can select it in "Select View" and "Save selection".

tiborsimko commented 8 years ago

Indeed, the web site is not accessible to me. @RaoOfPhysics can you please have a look?

katilp commented 8 years ago

@tiborsimko we probably need to extract these numbers from McM with a script. Maybe when I get an example script to get them for one dataset, you can help me with a script to get them for all list?

katilp commented 7 years ago

Update after a discussio with Luca Perrozzi: For 2011, 2012 these numbers are not necessarily in mcm However, they can be found in "PREP" i.e. for 2011 http://cms.cern.ch/iCMS/prep/requestmanagement?campid=Summer11 (CMS internal: gives a list of all datasets in Summer11 production campaign with the necessary numbers:

prep

I can have this list in html. @tiborsimko would you be able to extract the three numbers for each dataset record and insert them to the records?

(Note: The CMSSW version in the listing is different as that of AODSIM, as this refers to the generator files which may have been with a different version (and in any case has no influence to the cross-section value))

For 2015 on, the values can be extracted from McM with a script

Note that only background MC samples have a cross-section value, for signal MC it is set to one.

Furthermore, alternatively, the user can run a script in the VM over the CM sample https://twiki.cern.ch/twiki/bin/viewauth/CMS/HowToGenXSecAnalyzer which computes the cross section from the file. However, the documentation say that this only exists from CMSSW_5_3_21 on and the MC samples 2011 have been produced earlier.

In addition, more precise values for most important background samples are collected in pages https://twiki.cern.ch/twiki/bin/viewauth/CMS/StandardModelCrossSections (for 7 TeV 2011) and https://twiki.cern.ch/twiki/bin/viewauth/CMS/StandardModelCrossSectionsat8TeV (for 8 TeV 2011) and these pages can go public.

katilp commented 7 years ago

The file containing the cross-section values and filter efficiencies (in html) is now in https://cernbox.cern.ch/index.php/s/zalUiS5SoqU7Y1f (link update 16/04/2018)

As it has the full Summer11 MC production campaign, it has much more entries (3503) than we have on the portal and therefore it contains entries which we do not have.

However, for the public 2011 MC, I hope that there is 1-to-1 correspondence with the dataset names and the entries in this table.

The numbers of interest are two first of the three values before each dataset name starting from the listing part i.e.


<td style="font-size: 10px;">2.98E7</td>
<td style="font-size: 10px;">3.188E-4</td>
<td style="white-space: nowrap;">-1</td>
<td style="white-space: nowrap;">BpToPsiMuMu_2MuPEtaFilter_Tight_7TeV-pythia6-evtgen</td>

It is quite large (80MB) and if needed we can have a look together to have it in some more reasonable format. @tiborsimko let me know...

RaoOfPhysics commented 7 years ago

@tiborsimko: Sorry I missed this:

Indeed, the web site is not accessible to me. @RaoOfPhysics can you please have a look?

It's indeed internal. Apologies about the delay in confirming.

katilp commented 7 years ago

For the 2012 release, a similar listing (html extract) resulting from http://cms.cern.ch/iCMS/prep/requestmanagement?campid=Summer12_DR53X

is in https://cernbox.cern.ch/index.php/s/hNofuQSnDcmLtWt (link updated 16/04/2018)

katilp commented 7 years ago

A note for the numbers to be extracted:

cross-section
filter efficiency
matching efficiency

are to be multiplied to obtain the correct effective cross section. In PREP (i.e. for samples of 2011 and 2012) the "filter efficiency" is the product of filter efficiency and match efficiency (which appear as separate entries in McM i.e. for samples of 2015 and beyond). I suggest that we should add these three fields for MC records already now (with match efficiency = 1 for the samples from 2011 and 2012). Or display the value as "Filter efficiency * Match efficiency".

ArtemisLav commented 7 years ago

For the 2011 MC records, we could just add a new field (for example, 944 is not specified in the MARC21 documentation, so it should be empty), which could be called "effective cross section" and have 3 subfields:

944 $c (cross-section)
944 $f (filter efficiency)
944 $m (matching efficiency)

For 2012, we will have the new schemas, so we can adjust it or change it completely.

I'm not sure how we usually extract information to insert it to the records. @tiborsimko , how do you think we could go about doing this?

katilp commented 7 years ago

The pdf file of the 2011 listing (see above for html https://github.com/cernopendata/opendata.cern.ch/issues/1137#issuecomment-249569559) PREP - Request Management 2011 xc.pdf

tiborsimko commented 7 years ago

Refarding efficiencies, there are also corresponding errors, for example:

Dataset: /ttbarZ_8TeV-Madspin_aMCatNLO-herwig/Summer12_DR53X-PU_S10_START53_V19-v1/AODSIM
    Parent dataset: /ttbarZ_8TeV-Madspin_aMCatNLO-herwig/Summer12-START53_V7C-v1/GEN-SIM
        Generator parameters:
            Cross section: 0.1746
            Filter efficiency: 1
            Filter efficiency error: 0
            Match efficiency: 1
            Match efficiency error: -1

Do we want to store those separately?

tiborsimko commented 7 years ago

Here are various values for 2012 MC datasets:

Match efficiency error: 0.000141
Match efficiency error: 0.000173
Match efficiency error: 0.0001
Match efficiency error: 0.001755
Match efficiency error: 0.001
Match efficiency error: 0.0025
Match efficiency error: 0.002
Match efficiency error: 0.005
Match efficiency error: 0.015
Match efficiency error: 0.01
Match efficiency error: 0.02
Match efficiency error: 0.03
Match efficiency error: 0.05
Match efficiency error: 0.1
Match efficiency error: 0
Match efficiency error: -1
Match efficiency error: 1
Match efficiency error: 3.3e-05
Match efficiency error: 3.6e-05
Filter efficiency error: 0.00017
Filter efficiency error: 0.00026
Filter efficiency error: 0.00034
Filter efficiency error: 0.00048
Filter efficiency error: 0.00051
Filter efficiency error: 0.0005
Filter efficiency error: 0.00063
Filter efficiency error: 0.0012
Filter efficiency error: 0.001
Filter efficiency error: 0.003
Filter efficiency error: 0.01
Filter efficiency error: 0.02
Filter efficiency error: 0
Filter efficiency error: 1.41e-05
Filter efficiency error: -1
Filter efficiency error: 1
Filter efficiency error: 2.2e-05
Filter efficiency error: 2e-05
Filter efficiency error: 3.4e-05
Filter efficiency error: 3.8e-05
Filter efficiency error: 3.9e-05
Filter efficiency error: 3e-05
Filter efficiency error: 4.24e-05
Filter efficiency error: 4.4e-05
Filter efficiency error: 4.69e-05
Filter efficiency error: 4.9e-05
Filter efficiency error: 4e-06
Filter efficiency error: 5e-05
Filter efficiency error: 6.48e-05
Filter efficiency error: 8e-06

katilp commented 7 years ago

The cross-sections are now avalaible from CMSDAS where they can be extracted in more straigth forward way. In any case (from Luca Perrozzi)

these values are usually computed with the generator used to produce events and not the latest and greatest calculation available. In general, these values are good starting point for the analysts, so I suggest to use them especially if they can be retrieved with a script. However, they should be updated whenever possible with the tables provided in twikis like https://twiki.cern.ch/twiki/bin/viewauth/CMS/StandardModelCrossSections https://twiki.cern.ch/twiki/bin/viewauth/CMS/StandardModelCrossSectionsat8TeV

We have an (oral) agreement from CMS MC group for providing these numbers in public.

katilp commented 6 years ago

Note information about matching and filter efficiencies in

https://twiki.cern.ch/twiki/bin/view/CMS/PdmVMcMGenContact#Matching_and_filter_efficiency https://twiki.cern.ch/twiki/bin/viewauth/CMS/Moriond18MC#About_filter_and_matching_effici

From https://indico.cern.ch/event/673255/ To be judged whether relevant for us.

katilp commented 5 years ago

Note the recipe in https://twiki.cern.ch/twiki/bin/view/CMS/HowToGenXSecAnalyzer

heitorPB commented 5 years ago

The Cross Section DB Portal: https://cms-gen-dev.cern.ch/xsdb/

and the twiki: https://twiki.cern.ch/twiki/bin/view/CMS/GenXsecTaskForce

tiborsimko commented 5 years ago

See also RFC https://github.com/cernopendata/opendata.cern.ch/issues/2467 about storing cross section information in the data model fields.

katilp commented 5 years ago

Closing as followed up now in #2476

katilp commented 5 years ago

Reopening as the nice recipe in #2476 will only work for dataset produced with CMSSW higher than 5_3_31

katilp commented 5 years ago

The cross-section values are now extracted by the provenance script and available in cache, but not yet displayed.

katilp commented 11 months ago

Closing as now superseded by #3454

cernopendata / opendata.cern.ch

Additional information to be added for CMS MC datasets #1137