Open tiborsimko opened 9 years ago
Would it be on @tpmccauley to provide that?
It would. What would be the best medium?
I don't know. Upload a separate txt file? I could hack it into the metadata as we did for ATLAS. Probably depends on how extensive you want to make it. You're the expert :).
@tpmccauley @pherterich Since the colum descriptions can change as file versions change, it makes sense to store them in a versioned way for better preservability. For ATLAS, we have stored them simply in the record itself, in the MARC tag 505:
Provided the column descriptions are reasonably short, this seems acceptable.
If the descriptions may tend to get very long, or if a record contains more than one kind of CSV files, then we'd better store them apart, in a sort of additional "meaning" file next to the data file. E.g. for file foo.csv
we could introduce foo-csv-column-description.json
or somesuch that would describe columns in a machine-processable way. @pherterich @suenjedt is there some recommendation in the DP world on how "raw data files" and their "meaning files" are coupled together, without going too deeply into the LD and RDF world? (unless we want to go there already?)
In any case, it would be great to use the same technique for ATLAS and CMS records, so that we can use uniform tools to manage/show CSV column descriptions for all our records.
I just checked examples from PANGAEA and they just label the files and include the parameters in the metadata which is then linked to the ISO standards they use. So they real meaning is outside the file. Also Genbank examples don't define a lot but just label lines and rows and expect the user to know what it means or provide glossaries where this could be looked up. Depending on the information it can go in the glossary or the metadata and once we have our first more complex case, we hopefully will have made some progress with RDF etc. Does that sound feasible?
Thanks for checking, I'd vote then for keeping the current policy, i.e. to store brief column descriptions in records in the MARC tag 505, as we did for ATLAS. (and we'd eventually refactor later)
@tpmccauley Can you please take care of describing the CSV columns?
@tiborsimko @pherterich
I describe the csv fields here: https://github.com/cernopendata/opendata.cern.ch/blob/master/invenio_opendata/base/templates/visualise_histograms.html#L199
so that they show up when one hovers over the parameter button in the histogram application: http://opendatadev.cern.ch/visualise/histograms/CMS
Is this sufficient?
The templates may change, so I think it's better to use the description in the record themselves, as mentioned in https://github.com/cernopendata/opendata.cern.ch/issues/728#issuecomment-71651197
Perhaps @AnxhelaDani can prepare a metadata update PR based on your template text?
@tiborsimko @ArtemisLav @tpmccauley I've added this to fast lane, I think should be quite easy to implement (i.e. add the description to the record). If not, feel free to change
@ArtemisLav Can you please create the corresponding MARC tags 505 $t $g
?
@tpmccauley What about the csv files with px, py and pz and such? Or are all the csv file currently on the portal with pt, eta and phi (i.e. without momentum components separately)? Thanks!
@ArtemisLav Here is an overview of all the CMS CSV file headers:
==> 4lepton.csv <==
Event,Run,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,E3,px3,py3,pz3,pt3,eta3,phi3,Q3,E4,px4,py4,pz4,pt4,eta4,phi4,Q4,M
==> dielectron100k.csv <==
Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> dielectron-Jpsi.csv <==
Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> dielectron-Upsilon.csv <==
Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> dimuon100k.csv <==
Type,Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> dimuon-Jpsi.csv <==
Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> diphoton.csv <==
Run,Event,pt1,eta1,phi1,pt2,eta2,phi2,M
==> masterclass_10-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_10-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_10-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_11-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_11-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_11-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_12-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_12-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_12-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_13-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_13-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_13-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_14-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_14-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_14-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_15-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_15-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_15-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_16-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_16-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_16-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_17-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_17-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_17-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_18-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_18-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_18-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_19-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_19-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_19-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_1-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_1-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_1-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_2-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_2-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_2-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_3-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_3-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_3-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_4-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_4-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_4-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_5-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_5-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_5-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_6-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_6-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_6-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_7-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_7-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_7-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_8-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_8-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_8-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> masterclass_9-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M
==> masterclass_9-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET
==> masterclass_9-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_0.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_1.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_2.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_3.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_4.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_5.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_6.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_7.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_8.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_9.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Wenu.csv <==
Run,Event,E,px,py,pz,pt,eta,phi,Q,MET,phiMET
==> Wmunu.csv <==
Run,Event,E,px,py,pz,pt,eta,phi,Q,MET,phiMET
==> Zee.csv <==
Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
==> Zmumu.csv <==
Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
@katilp the fields in the csv files differ. For some there was an explicit request for E,px,py,pz to be included in addition to pt, eta, and phi.
So is 505 the field one should use to describe the csv column contents?
@tpmccauley yes, that is the field (e.g. see here http://opendata.cern.ch/record/554/export/xm). Will you be taking care of that or should we?
@ArtemisLav I'm not sure I have the time to do it for the already-included csv files but can do it for the new ones I am preparing.
@tpmccauley that would be great, thanks, and in the meantime, I can do the ones we already have.
@ArtemisLav I will document the new csv files today. That information can be used for the older ones.
@tiborsimko I think this is now addressed through #2784 Most of the datasets in http://opendata-dev.cern.ch/search?page=2&size=20&q=&experiment=CMS&file_type=csv&subtype=Derived&type=Dataset# now have the dataset semantics with the description of the variables apart from
The CSV files of CMS contain typically the following column description:
It would be useful (both to users and for long-term preservation purposes) to store expanded semantics behind the columns.
P.S. See also forthcoming ATLAS Higgs challenge CSV files where the columns are described in an accompanying PDF documentation. It would be useful to store the meaning in the metadata and/or next to files as well.