cernopendata / opendata.cern.ch

Source code for the CERN Open Data portal
http://opendata.cern.ch/
GNU General Public License v2.0
656 stars 147 forks source link

CMS: better documentation of CSV columns #728

Open tiborsimko opened 9 years ago

tiborsimko commented 9 years ago

The CSV files of CMS contain typically the following column description:

Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

It would be useful (both to users and for long-term preservation purposes) to store expanded semantics behind the columns.

P.S. See also forthcoming ATLAS Higgs challenge CSV files where the columns are described in an accompanying PDF documentation. It would be useful to store the meaning in the metadata and/or next to files as well.

pherterich commented 9 years ago

Would it be on @tpmccauley to provide that?

tpmccauley commented 9 years ago

It would. What would be the best medium?

pherterich commented 9 years ago

I don't know. Upload a separate txt file? I could hack it into the metadata as we did for ATLAS. Probably depends on how extensive you want to make it. You're the expert :).

tiborsimko commented 9 years ago

@tpmccauley @pherterich Since the colum descriptions can change as file versions change, it makes sense to store them in a versioned way for better preservability. For ATLAS, we have stored them simply in the record itself, in the MARC tag 505:

Provided the column descriptions are reasonably short, this seems acceptable.

If the descriptions may tend to get very long, or if a record contains more than one kind of CSV files, then we'd better store them apart, in a sort of additional "meaning" file next to the data file. E.g. for file foo.csv we could introduce foo-csv-column-description.json or somesuch that would describe columns in a machine-processable way. @pherterich @suenjedt is there some recommendation in the DP world on how "raw data files" and their "meaning files" are coupled together, without going too deeply into the LD and RDF world? (unless we want to go there already?)

In any case, it would be great to use the same technique for ATLAS and CMS records, so that we can use uniform tools to manage/show CSV column descriptions for all our records.

pherterich commented 9 years ago

I just checked examples from PANGAEA and they just label the files and include the parameters in the metadata which is then linked to the ISO standards they use. So they real meaning is outside the file. Also Genbank examples don't define a lot but just label lines and rows and expect the user to know what it means or provide glossaries where this could be looked up. Depending on the information it can go in the glossary or the metadata and once we have our first more complex case, we hopefully will have made some progress with RDF etc. Does that sound feasible?

tiborsimko commented 9 years ago

Thanks for checking, I'd vote then for keeping the current policy, i.e. to store brief column descriptions in records in the MARC tag 505, as we did for ATLAS. (and we'd eventually refactor later)

tiborsimko commented 9 years ago

@tpmccauley Can you please take care of describing the CSV columns?

tpmccauley commented 8 years ago

@tiborsimko @pherterich

I describe the csv fields here: https://github.com/cernopendata/opendata.cern.ch/blob/master/invenio_opendata/base/templates/visualise_histograms.html#L199

so that they show up when one hovers over the parameter button in the histogram application: http://opendatadev.cern.ch/visualise/histograms/CMS

Is this sufficient?

tiborsimko commented 8 years ago

The templates may change, so I think it's better to use the description in the record themselves, as mentioned in https://github.com/cernopendata/opendata.cern.ch/issues/728#issuecomment-71651197

Perhaps @AnxhelaDani can prepare a metadata update PR based on your template text?

katilp commented 7 years ago

@tiborsimko @ArtemisLav @tpmccauley I've added this to fast lane, I think should be quite easy to implement (i.e. add the description to the record). If not, feel free to change

tiborsimko commented 7 years ago

@ArtemisLav Can you please create the corresponding MARC tags 505 $t $g?

katilp commented 7 years ago

@tpmccauley What about the csv files with px, py and pz and such? Or are all the csv file currently on the portal with pt, eta and phi (i.e. without momentum components separately)? Thanks!

tiborsimko commented 7 years ago

@ArtemisLav Here is an overview of all the CMS CSV file headers:

==> 4lepton.csv <==
Event,Run,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,E3,px3,py3,pz3,pt3,eta3,phi3,Q3,E4,px4,py4,pz4,pt4,eta4,phi4,Q4,M

==> dielectron100k.csv <==
Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> dielectron-Jpsi.csv <==
Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> dielectron-Upsilon.csv <==
Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> dimuon100k.csv <==
Type,Run,Event,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> dimuon-Jpsi.csv <==
Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> diphoton.csv <==
Run,Event,pt1,eta1,phi1,pt2,eta2,phi2,M

==> masterclass_10-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_10-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_10-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_11-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_11-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_11-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_12-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_12-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_12-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_13-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_13-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_13-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_14-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_14-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_14-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_15-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_15-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_15-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_16-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_16-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_16-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_17-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_17-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_17-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_18-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_18-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_18-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_19-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_19-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_19-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_1-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_1-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_1-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_2-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_2-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_2-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_3-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_3-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_3-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_4-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_4-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_4-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_5-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_5-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_5-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_6-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_6-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_6-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_7-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_7-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_7-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_8-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_8-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_8-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> masterclass_9-leptons.csv <==
Event, E1, px1, py1, pz1, pt1, eta1, phi1, Q1, E2, px2, py2, pz2, pt2, eta2, phi2, Q2, M

==> masterclass_9-lnu.csv <==
Event, E, px, py, pz, pt, eta, phi, Q, MET, phiMET

==> masterclass_9-photon.csv <==
Event, pt1, eta1, phi1, pt2, eta2, phi2, M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_0.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_1.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_2.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_3.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_4.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_5.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_6.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_7.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_8.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon_9.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Run2010B_Mu_AOD_Apr21ReReco-v1-dimuon.csv <==
Run,Event,Type1,E1,px1 ,py1,pz1,pt1,eta1,phi1,Q1,Type2,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Wenu.csv <==
Run,Event,E,px,py,pz,pt,eta,phi,Q,MET,phiMET

==> Wmunu.csv <==
Run,Event,E,px,py,pz,pt,eta,phi,Q,MET,phiMET

==> Zee.csv <==
Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M

==> Zmumu.csv <==
Type,Run,Event,E1,px1,py1,pz1,pt1,eta1,phi1,Q1,E2,px2,py2,pz2,pt2,eta2,phi2,Q2,M
tpmccauley commented 7 years ago

@katilp the fields in the csv files differ. For some there was an explicit request for E,px,py,pz to be included in addition to pt, eta, and phi.

So is 505 the field one should use to describe the csv column contents?

ArtemisLav commented 7 years ago

@tpmccauley yes, that is the field (e.g. see here http://opendata.cern.ch/record/554/export/xm). Will you be taking care of that or should we?

tpmccauley commented 7 years ago

@ArtemisLav I'm not sure I have the time to do it for the already-included csv files but can do it for the new ones I am preparing.

ArtemisLav commented 7 years ago

@tpmccauley that would be great, thanks, and in the meantime, I can do the ones we already have.

tpmccauley commented 7 years ago

@ArtemisLav I will document the new csv files today. That information can be used for the older ones.

katilp commented 4 years ago

@tiborsimko I think this is now addressed through #2784 Most of the datasets in http://opendata-dev.cern.ch/search?page=2&size=20&q=&experiment=CMS&file_type=csv&subtype=Derived&type=Dataset# now have the dataset semantics with the description of the variables apart from