CMS: Run2 QCD MC for data science jettuples

katilp commented 6 years ago

In connection with #2440, this issue follows the jettuples to be produced from run2 AOD samples, to be made available on the portal.

The datasets:

Data science jettuples (contact Kimmo Kallonen HIP):

/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM
- Release: CMSSW_8_0_21 Global Tag: 80X_mcRun2_asymptotic_2016_TrancheIV_v6

To do:

[x] transfer the MINIAODSIM dataset to T3_CH_CERN_OpenData
- transfer request done ok
[x] prepare the data record for MiniAODSIM see #2525
[x] test that the production workflow works on CMS Open Data VM
- the VM with encapsulated slc6 #2426 can be used (it is not yet on the portal record)
- the correspondig GT is needed (#2443), now available (note the changes needed to read the GT on the config file + symbolic links: cfr http://opendata.cern.ch/docs/cms-guide-for-condition-database)
[ ] test that the production workflow works on ReANA (@alintulu)
[x] prepare the CODP record for the production workflow
- have it in a github repo (see with @caredg ): https://github.com/cms-legacydata-analyses/JetNtupleProducerTool
- provide the corrsponding metadata for the CODP record (see #2584 )
[x] produce the jettuple files and transfer to eospublic upload (see with @tiborsimko)
- they will be in Kimmo's public area to be copied over
[x] provide the metadata for the CODP records for thejettuple files (see https://github.com/cernopendata/opendata.cern.ch/issues/2447#issuecomment-477700348 below)
- NB the variable description is in https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/blob/2016/README.md
[x] if appropriate, provide an example of usage as a software record

For contributions, see also https://github.com/cernopendata/opendata.cern.ch/wiki/Contributing-content-to-CERN-Open-Data

katilp commented 6 years ago

The path to files can be found through https://eospublichttp01.cern.ch/eos/opendata/cms/MonteCarlo2016/RunIISummer16MiniAODv2 (later, they will be available through the portal record)

NB: (from Kimmo): the VM default architecture is slc6_amd64_gcc472, but CMSSW_8_0_26 needs slc6_amd64_gcc530. arch must be changed when doing cmsrel

kimmokal commented 6 years ago

After trying out the cmsrel CMSSW_8_0_26 command a few times, I got two different results. There was either just a warning: WARNING: Release CMSSW_8_0_26 is not available for architecture slc6_amd64_gcc472. Developer's area is created for available architecture slc6_amd64_gcc530.

Or an error: ERROR: Unable to find release area for "CMSSW" version "CMMSW_8_0_26" for arch slc6_amd64_gcc472. Please make sure you have used the correct name/version.

I couldn't figure out when the result was just a warning, or an actual error. This can be avoided by manually changing the architecture export SCRAM_ARCH=slc6_amd64_gcc530. After that cmsrel CMSSW_8_0_26 works fine, and setting cmsenv will automatically change the arch to gcc530 in later shell instances.

It is worth noting that with the SCRAM arch set to gcc530, the cmsrel command for earlier releases (such as 5_3_32) used for Run I datasets doesn't work without changing the arch again. So I don't know which arch is better as the default.

UPDATE: I now realize that the error above was actually my mistake and was caused by a careless typo... Here is a concise dissection of the situation anyway:

The default SCRAM_ARCH of the VM is slc6_amd64_gcc472
cmsrel CMSSW_8_0_26 prints a warning, but creates the CMSSW area anyway
Setting cmsenv at CMSSW_8_0_26/src/ changes SCRAM_ARCH to slc6_amd64_gcc530
After the SCRAM_ARCH has changed, cmsrel 5_3_32 doesn't work and prints an error
A new shell instance will again have SCRAM_ARCH=slc6_amd64_gcc472

I guess that having slc6_amd64_gcc472 as the default SCRAM architecture is then fine, but there might be some confusion if someone first works with a Run II-friendly CMSSW version and then tries to go back to creating a CMSSW area for Run I datasets using the same shell instance. This is probably a very rare issue, but it's something to be aware of.

katilp commented 5 years ago

@kimmokal Are the tuples ready to be copied over?

kimmokal commented 5 years ago

@katilp The .root files are placed in my EOS space and can be found in the path /eos/user/k/kkallone/JetNTuple_QCD_RunII_13TeV_MC/

The HDF5 conversion has truly been a headache due to data columns with variable length, but I think/hope I have now conquered the major obstacles. I will spend this afternoon processing all the files and validating that they work as they should. If all goes well, they can be copied over also later today.

katilp commented 5 years ago

@kimmokal when ready check the permissions of the directory (for the moment I can't access it)

kimmokal commented 5 years ago

@katilp Can you now access the directory?

I verified that the HDF5 conversion is working as it should. However, turns out that lxplus is so ridiculously slow right now that I will do the converting locally, which admittedly will also take a long time. Hence, the .h5 files will be ready to be copied over tomorrow.

kimmokal commented 5 years ago

@katilp The conversion is now ready. I ended up doing it in a parallel fashion on lxplus. I was perhaps a bit unwise in putting the converted .h5 files in the same folder as the .root files. So if you already started copying the files, you might have ended up copying unfinished .h5 files in the process.

katilp commented 5 years ago

@kimmokal OK, thanks we'll have a look. We did not start copying yet. For the variable description in https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016, would you be able to provide a public page with the relevant information which now resides in CMS internal pages

https://twiki.cern.ch/twiki/bin/view/CMS/JetID13TeVRun2016
https://twiki.cern.ch/twiki/bin/viewauth/CMS/QuarkGluonLikelihood ? We are planning to have the variable description on the record in a similar way as for the existing http://opendata-dev.cern.ch/record/328 (for the underlying structure: http://opendata-dev.cern.ch/record/328/export/json)

kimmokal commented 5 years ago

@katilp I actually updated the README for the Master branch earlier, removing all references to the internal twiki pages, but I forgot to update the 2016 branch as well. I'll fix that.

katilp commented 5 years ago

@kimmokal Could you also provide a description text for the purpose of these files (cfr at the beginning of http://opendata-dev.cern.ch/record/328, but does not need to be that long)?

katilp commented 5 years ago

@ArtemisLav Could you kindly build a record (similar to http://opendata-dev.cern.ch/record/328) for these Data science jettuples:

Dataset name from @kimmokal
Author is Kimmo Kallonen
@kimmokal will provide a description (for "abstract")
Dataset characteristics (@kimmokal can provide N events, n files, volume)
Dataset semantics can be taken from the table in https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/master
Instead of "How were these data generated?" it would be better to have "How were these data produced?" with a link to https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016 (it will have a corresponding SW record but it has not yet been created)
How were these data validated? can be left out (tbc @kimmokal )
How can you use these data? @kimmokal can give a text and link (maybe not ready yet)

We discussed with @tiborsimko that it could go to a new cms-derived-Run2-datascience.json

It would be good to have this as an example record, so that other similar records (3 more to come) can be based on this. Thanks!

kimmokal commented 5 years ago

@katilp I updated the readme of the github repo and merged the master into the 2016 branch, so it's up-to-date now. Note that there is the line 'git clone https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/', where the 'cms-legacydata-analyses' part needs to be changed in the actual release.

I am now in the process of writing the description for the dataset. I'll send it to you (or to @ArtemisLav ?) during the weekend. I don't think there's a need for the 'How were these data validated?' section.

ArtemisLav commented 5 years ago

I started putting the record together based on the metadata provided here. This is what I have so far. The FIXME's need to be addressed as well as anything else you think is wrong/missing.

Regarding the GitHub repo, for preservation purposes we generally prefer it if repos have a release tag for the specific version - that can also be grabbed as a tarball, e.g. see here.

katilp commented 5 years ago

Excellent, thanks! When ready for the release, the github repo will be placed in https://github.com/cms-opendata-analyses (@kimmokal pls see with @caredg ) and it will also have a release tag.

For the dataset semantics, @tiborsimko wanted to see with you if it could be possible enriched with the "type" field. Do you think so?

katilp commented 5 years ago

@kimmokal Do you have a description text for this ML file record already? It would be useful so that we can build the record and show it as an example to the others

kimmokal commented 5 years ago

@katilp I'm sorry, I've been overwhelmed by deadlines. I'll do my best to get the description done today.

ArtemisLav commented 5 years ago

@katilp semantics are fixed 44907a059b10ed62bd85b6ec320e0f003dcde7a5

kimmokal commented 5 years ago

@katilp @ArtemisLav I wrote up this description:

"The dataset consists of particle jets extracted from simulated proton-proton collision events at a center-of-mass energy of 13 TeV generated with Pythia 8. The particles emerging from the collisions traverse through a simulation of the CMS detector. The particles were reconstructed from the simulated detector signals using the particle-flow (PF) algorithm. The reconstructed particles are also called PF candidates. The jets in this dataset were clustered from the PF candidates of each collision event using the anti-$k_t$ algorithm with distance parameter $R = 0.4$.

From each collision event, only those jets with transverse momentum exceeding 30 GeV were saved to file. The jets were also required to have pseudorapidity of less than 2.5 (this indicates the jet's position in the detector). For each jet, there are variables describing the jet on a high-level, particle-level and generator-level. There are also some variables describing the collision event and the conditions of its simulation. All of the variables are saved on a jet-by-jet basis, which means that one row of data corresponds to one jet.

The origin of a jet is particularly interesting. This so-called flavor of the jet is obtained from the generator-level particles by a jet flavor algorithm, which attempts to match a reconstructed jet to a single initiating particle. As a consequence, the jet flavor definition depends on the chosen algorithm. Here three different flavor definitions are available. The ‘hadron’ definition identifies b- and c-hadrons from the jet’s constituents, so it is only useful for b-tagging studies. The ‘parton’ definition extends this to include the light jet flavors (u, d, s and gluon). Finally there is the ‘physics’ definition, which looks at the quarks and gluons of the initial collision. The ‘parton’ and ‘physics’ definitions both identify all jet flavors, but the former is more biased towards b- and c-quarks. If in doubt, it is recommended to use the ‘physics’ definition."

I can extend it if it's too short or is missing necessary details.

Also, my Orcid-id is 0000-0001-9769-7163. Is there something else required from me for the metadata?

katilp commented 5 years ago

@ArtemisLav Could you also add:

"relations": [
      {
        "doi": "FIXME", 
        "recid": "12021", 
        "title": "/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM", 
        "type": "isChildOf"
      }
    ],

This file is on dev but does not have DOI yet

katilp commented 5 years ago

@kimmokal If you have an example notebook or similar it could possibly be entered under "usage". The regular data samples have something like

usage

so in contrast it would maybe be useful to mention that this does not require any CMS experiment specific environment and can be used as in some example (if you have a link)

ArtemisLav commented 5 years ago

Thanks @kimmokal

Is there something else required from me for the metadata?

I just need a title for the record and if possible the distribution information (dataset characteristics): "distribution": { "formats": [ "e.g. root" ], "number_events": 11111, "number_files": 2222, "size": 3333 },

tiborsimko commented 5 years ago

@ArtemisLav I'm copying the files to the final destination, I'll supply all the file information (except number of events).

BTW we'll have 122 files in ROOT and H5 formats, and the data contained in them should be equivalent, so I wonder whether we shall say that this dataset contains 122 or 244 files? I guess the latter, but that cold also confuse some people, e.g. those that only want ROOT and they might wonder why there is only 122 of them... Any DCAT etc standards out there for this "alternative formats" cases?

ArtemisLav commented 5 years ago

@tiborsimko hmm would it be easier if we just add a note in usage perhaps?

tiborsimko commented 5 years ago

Yes, I would list all files the record holds, and usage note could explain ROOT vs H5 formats indeed.

Note that there is is a transfer trouble with three H5 files, but otherwise we are good to create this test record.

ArtemisLav commented 5 years ago

OK, could someone please provide that description?

katilp commented 5 years ago

Could be something like this, I leave @kimmokal to complete: "The use of these files does not require any software specific to the CMS experiment. There are two sets of equivalent files in two different formats: ROOT and H5"

@kimmokal you could add the mention of h5 specific stuff maybe.

@tiborsimko Should we use H5, h5, HD5, hd5, HDF5, hdf5... in the text?

kimmokal commented 5 years ago

@katilp I have been struggling with the notebook and making it practical enough :/ I could provide in the usage part just two short examples of scripts for loading the .root and .h5 files in Python?

@ArtemisLav @tiborsimko How do we deal with the "number_events" here, because the files don't contain full events, only jets? Should it just be the total number of jets then?

Do you have any suggestion for the title of the record?

katilp commented 5 years ago

@kimmokal that would be very good as well. Good point about event numbers, it may well be the same for some other ML samples. Should we have an alternative "number_entries" or similar?

katilp commented 5 years ago

Some suggestions to the current draft:

[x] can we have Data science as a tab in the header part in a similar way as the categories for MC in: i.e. there:
- datascience is a keyword whereas the processes for the MC record are Categories. Now we have two possible keywords for derived datasets, either masterclass or datascience, it would be good to display them after Dataset Derived (they would display better starting with a capital letter though)
[x] it would be better to have a tabular display for data semantics
[ ] How to use (usage) should come before dataset semantics

[x] Can Related datasets have a description? If so then:

"relations": [
  {
    "description": "<p>This dataset was derived from: </p> ",
    "recid": "12021", 
    "title": "/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM", 
    "type": "isChildOf"
  }

[ ] for h5 files, the file listing with root prefix is not useful, if I'm not mistaking, would a listing directly usable for wget be better?
[ ] in this context if we do not expect people to download the files with xrootd (not the h5), the "Download" button is misleading. Can it be "Download index"?
[x] will the file generation/production part be in methodology or else? If so then
```
"methodology": {
  "description": " <p>This dataset was produced with the software available in: </p>"
}
```
with the link to the record to be built for the SW (fifth item in the initial list at the start of this issue), @tiborsimko would that work?
[ ] As the dataset does not contain events but jets, would it be possible to have an alternative for number_events in:
```
"distribution": {
  "formats": [
    "root", 
    "h5"
  ], 
  "number_events": 11111, 
  "number_files": 244, 
  "size": 204611954128
}, 
```
it could be "Entries" (number_entries). This may be needed for some other ML samples as well.

tiborsimko commented 5 years ago

@katilp It would be nice to open indepedent issue for these things, so that the work can be parallelised. E.g. @okraskaj can take care of the template amendment while @ArtemisLav could take care of metadata editing.... I'll be busy today until late in the afternoon.

Some quick comments:

Yes we can display keywords after categories. (@okraskaj )
Yes for nicer semantics display #2577
Usage before semantics is not obvious, since we also have selection, validatiion sections, and we'd have to move characteristics as well to stay close to semantics... The last two had better stay close together
Yes for methodology and software record, @ArtemisLav will you have time to add it?
As for the number_events', the number 1111 was just a placeholder I think(?) we can simply remove it. Not sure about the change tonumber_entrieseverywhere, I think we should rather introduce a new propertynumber_jets` if needed? (@okraskaj)

katilp commented 5 years ago

@tiborsimko yes, indeed number_entries or number_jets just as alternative, like here, not change it everywhere

ArtemisLav commented 5 years ago

Yes for methodology and software record, @ArtemisLav will you have time to add it?

Sure, is there metadata somewhere?

katilp commented 5 years ago

@ArtemisLav the description is above i.e.

"methodology": {
     "description": " <p>This dataset was produced with the software available in: </p>"
 }

and the link to the software record will need to be a place holder for now

ArtemisLav commented 5 years ago

@katilp I meant for the software record. Are we doing that now?

katilp commented 5 years ago

@ArtemisLav not yet done, but I can add it. Should I open an issue for all sw records needed for these ML samples, or one by one?

katilp commented 5 years ago

Closing as remaining issues of Run2 MiniAODSIM provenance followed up in #2525

cernopendata / opendata.cern.ch

CMS: Run2 QCD MC for data science jettuples #2447