Closed katilp closed 5 years ago
The path to files can be found through https://eospublichttp01.cern.ch/eos/opendata/cms/MonteCarlo2016/RunIISummer16MiniAODv2 (later, they will be available through the portal record)
NB: (from Kimmo): the VM default architecture is slc6_amd64_gcc472, but CMSSW_8_0_26 needs slc6_amd64_gcc530. arch must be changed when doing cmsrel
After trying out the cmsrel CMSSW_8_0_26
command a few times, I got two different results. There was either just a warning:
WARNING: Release CMSSW_8_0_26 is not available for architecture slc6_amd64_gcc472. Developer's area is created for available architecture slc6_amd64_gcc530.
Or an error:
ERROR: Unable to find release area for "CMSSW" version "CMMSW_8_0_26" for arch slc6_amd64_gcc472. Please make sure you have used the correct name/version.
I couldn't figure out when the result was just a warning, or an actual error. This can be avoided by manually changing the architecture export SCRAM_ARCH=slc6_amd64_gcc530
. After that cmsrel CMSSW_8_0_26
works fine, and setting cmsenv
will automatically change the arch to gcc530 in later shell instances.
It is worth noting that with the SCRAM arch set to gcc530, the cmsrel
command for earlier releases (such as 5_3_32) used for Run I datasets doesn't work without changing the arch again. So I don't know which arch is better as the default.
UPDATE: I now realize that the error above was actually my mistake and was caused by a careless typo... Here is a concise dissection of the situation anyway:
cmsrel CMSSW_8_0_26
prints a warning, but creates the CMSSW area anywaycmsenv
at CMSSW_8_0_26/src/ changes SCRAM_ARCH to slc6_amd64_gcc530cmsrel 5_3_32
doesn't work and prints an errorI guess that having slc6_amd64_gcc472 as the default SCRAM architecture is then fine, but there might be some confusion if someone first works with a Run II-friendly CMSSW version and then tries to go back to creating a CMSSW area for Run I datasets using the same shell instance. This is probably a very rare issue, but it's something to be aware of.
@kimmokal Are the tuples ready to be copied over?
@katilp The .root files are placed in my EOS space and can be found in the path /eos/user/k/kkallone/JetNTuple_QCD_RunII_13TeV_MC/
The HDF5 conversion has truly been a headache due to data columns with variable length, but I think/hope I have now conquered the major obstacles. I will spend this afternoon processing all the files and validating that they work as they should. If all goes well, they can be copied over also later today.
@kimmokal when ready check the permissions of the directory (for the moment I can't access it)
@katilp Can you now access the directory?
I verified that the HDF5 conversion is working as it should. However, turns out that lxplus is so ridiculously slow right now that I will do the converting locally, which admittedly will also take a long time. Hence, the .h5 files will be ready to be copied over tomorrow.
@katilp The conversion is now ready. I ended up doing it in a parallel fashion on lxplus. I was perhaps a bit unwise in putting the converted .h5 files in the same folder as the .root files. So if you already started copying the files, you might have ended up copying unfinished .h5 files in the process.
@kimmokal OK, thanks we'll have a look. We did not start copying yet. For the variable description in https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/tree/2016, would you be able to provide a public page with the relevant information which now resides in CMS internal pages
@katilp I actually updated the README for the Master branch earlier, removing all references to the internal twiki pages, but I forgot to update the 2016 branch as well. I'll fix that.
@kimmokal Could you also provide a description text for the purpose of these files (cfr at the beginning of http://opendata-dev.cern.ch/record/328, but does not need to be that long)?
@ArtemisLav Could you kindly build a record (similar to http://opendata-dev.cern.ch/record/328) for these Data science jettuples:
We discussed with @tiborsimko that it could go to a new cms-derived-Run2-datascience.json
It would be good to have this as an example record, so that other similar records (3 more to come) can be based on this. Thanks!
@katilp I updated the readme of the github repo and merged the master into the 2016 branch, so it's up-to-date now. Note that there is the line 'git clone https://github.com/cms-legacydata-analyses/JetNtupleProducerTool/', where the 'cms-legacydata-analyses' part needs to be changed in the actual release.
I am now in the process of writing the description for the dataset. I'll send it to you (or to @ArtemisLav ?) during the weekend. I don't think there's a need for the 'How were these data validated?' section.
I started putting the record together based on the metadata provided here. This is what I have so far. The FIXME's need to be addressed as well as anything else you think is wrong/missing.
Regarding the GitHub repo, for preservation purposes we generally prefer it if repos have a release tag for the specific version - that can also be grabbed as a tarball, e.g. see here.
Excellent, thanks! When ready for the release, the github repo will be placed in https://github.com/cms-opendata-analyses (@kimmokal pls see with @caredg ) and it will also have a release tag.
For the dataset semantics, @tiborsimko wanted to see with you if it could be possible enriched with the "type" field. Do you think so?
@kimmokal Do you have a description text for this ML file record already? It would be useful so that we can build the record and show it as an example to the others
@katilp I'm sorry, I've been overwhelmed by deadlines. I'll do my best to get the description done today.
@katilp semantics are fixed 44907a059b10ed62bd85b6ec320e0f003dcde7a5
@katilp @ArtemisLav I wrote up this description:
"The dataset consists of particle jets extracted from simulated proton-proton collision events at a center-of-mass energy of 13 TeV generated with Pythia 8. The particles emerging from the collisions traverse through a simulation of the CMS detector. The particles were reconstructed from the simulated detector signals using the particle-flow (PF) algorithm. The reconstructed particles are also called PF candidates. The jets in this dataset were clustered from the PF candidates of each collision event using the anti-$k_t$ algorithm with distance parameter $R = 0.4$.
From each collision event, only those jets with transverse momentum exceeding 30 GeV were saved to file. The jets were also required to have pseudorapidity of less than 2.5 (this indicates the jet's position in the detector). For each jet, there are variables describing the jet on a high-level, particle-level and generator-level. There are also some variables describing the collision event and the conditions of its simulation. All of the variables are saved on a jet-by-jet basis, which means that one row of data corresponds to one jet.
The origin of a jet is particularly interesting. This so-called flavor of the jet is obtained from the generator-level particles by a jet flavor algorithm, which attempts to match a reconstructed jet to a single initiating particle. As a consequence, the jet flavor definition depends on the chosen algorithm. Here three different flavor definitions are available. The ‘hadron’ definition identifies b- and c-hadrons from the jet’s constituents, so it is only useful for b-tagging studies. The ‘parton’ definition extends this to include the light jet flavors (u, d, s and gluon). Finally there is the ‘physics’ definition, which looks at the quarks and gluons of the initial collision. The ‘parton’ and ‘physics’ definitions both identify all jet flavors, but the former is more biased towards b- and c-quarks. If in doubt, it is recommended to use the ‘physics’ definition."
I can extend it if it's too short or is missing necessary details.
Also, my Orcid-id is 0000-0001-9769-7163. Is there something else required from me for the metadata?
@ArtemisLav Could you also add:
"relations": [
{
"doi": "FIXME",
"recid": "12021",
"title": "/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM",
"type": "isChildOf"
}
],
This file is on dev but does not have DOI yet
@kimmokal If you have an example notebook or similar it could possibly be entered under "usage". The regular data samples have something like
so in contrast it would maybe be useful to mention that this does not require any CMS experiment specific environment and can be used as in some example (if you have a link)
Thanks @kimmokal
Is there something else required from me for the metadata?
I just need a title for the record and if possible the distribution information (dataset characteristics):
"distribution": { "formats": [ "e.g. root" ], "number_events": 11111, "number_files": 2222, "size": 3333 },
@ArtemisLav I'm copying the files to the final destination, I'll supply all the file information (except number of events).
BTW we'll have 122 files in ROOT and H5 formats, and the data contained in them should be equivalent, so I wonder whether we shall say that this dataset contains 122 or 244 files? I guess the latter, but that cold also confuse some people, e.g. those that only want ROOT and they might wonder why there is only 122 of them... Any DCAT etc standards out there for this "alternative formats" cases?
@tiborsimko hmm would it be easier if we just add a note in usage
perhaps?
Yes, I would list all files the record holds, and usage note could explain ROOT vs H5 formats indeed.
Note that there is is a transfer trouble with three H5 files, but otherwise we are good to create this test record.
OK, could someone please provide that description?
Could be something like this, I leave @kimmokal to complete: "The use of these files does not require any software specific to the CMS experiment. There are two sets of equivalent files in two different formats: ROOT and H5"
@kimmokal you could add the mention of h5 specific stuff maybe.
@tiborsimko Should we use H5, h5, HD5, hd5, HDF5, hdf5... in the text?
@katilp I have been struggling with the notebook and making it practical enough :/ I could provide in the usage part just two short examples of scripts for loading the .root and .h5 files in Python?
@ArtemisLav @tiborsimko How do we deal with the "number_events" here, because the files don't contain full events, only jets? Should it just be the total number of jets then?
Do you have any suggestion for the title of the record?
@kimmokal that would be very good as well. Good point about event numbers, it may well be the same for some other ML samples. Should we have an alternative "number_entries" or similar?
Some suggestions to the current draft:
Data science
as a tab in the header part in a similar way as the categories for MC in:
i.e. there:
masterclass
or datascience
, it would be good to display them after Dataset
Derived
(they would display better starting with a capital letter though)"relations": [
{
"description": "<p>This dataset was derived from: </p> ",
"recid": "12021",
"title": "/QCD_Pt-15to7000_TuneCUETP8M1_Flat_13TeV_pythia8/RunIISummer16MiniAODv2-PUMoriond17_magnetOn_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/MINIAODSIM",
"type": "isChildOf"
}
"methodology": {
"description": " <p>This dataset was produced with the software available in: </p>"
}
with the link to the record to be built for the SW (fifth item in the initial list at the start of this issue), @tiborsimko would that work?
number_events
in:
"distribution": {
"formats": [
"root",
"h5"
],
"number_events": 11111,
"number_files": 244,
"size": 204611954128
},
it could be "Entries" (number_entries
). This may be needed for some other ML samples as well.
@katilp It would be nice to open indepedent issue for these things, so that the work can be parallelised. E.g. @okraskaj can take care of the template amendment while @ArtemisLav could take care of metadata editing.... I'll be busy today until late in the afternoon.
Some quick comments:
number_events', the number 1111 was just a placeholder I think(?) we can simply remove it. Not sure about the change to
number_entrieseverywhere, I think we should rather introduce a new property
number_jets` if needed? (@okraskaj)@tiborsimko yes, indeed number_entries or number_jets just as alternative, like here, not change it everywhere
- Yes for methodology and software record, @ArtemisLav will you have time to add it?
Sure, is there metadata somewhere?
@ArtemisLav the description is above i.e.
"methodology": {
"description": " <p>This dataset was produced with the software available in: </p>"
}
and the link to the software record will need to be a place holder for now
@katilp I meant for the software record. Are we doing that now?
@ArtemisLav not yet done, but I can add it. Should I open an issue for all sw records needed for these ML samples, or one by one?
Closing as remaining issues of Run2 MiniAODSIM provenance followed up in #2525
In connection with #2440, this issue follows the jettuples to be produced from run2 AOD samples, to be made available on the portal.
The datasets:
Data science jettuples (contact Kimmo Kallonen HIP):
To do:
For contributions, see also https://github.com/cernopendata/opendata.cern.ch/wiki/Contributing-content-to-CERN-Open-Data