Convert from YAML format to CSV

michal-szostak commented 9 years ago

The conversion itself should not pose too much of a problem, but it's important to decide how exactly should the CSV file (files?) look. Should the tables be split into different files (and then potentially compressed? Or accessed per table basis?) or should everything be put into one file (how exactly?). There is also a question concerning documents and tables metadata representing which in CSV file may not be that straightforward...

I would like to hear some comments about what would be preferred output format.

eamonnmag commented 9 years ago

Firstly, the tables should be split in to different files. So one data tables = one CSV. Metadata for each table can be included in the CSV by placing a '#: value'

In terms of what else the CSV will contain, the current HEPdata outputs this - http://hepdata.cedar.ac.uk/view/ins1245023/d1/plain.txt. This is tab separated, which is equally ok. Switching a , with a \t is not an issue.

What I'd expect it to look like is something similar however with the dependent and independent variables as columns. Each dependent variable also has what are called qualifiers attached to them. This will look something similar to how I currently render the files on the front end. You can access a few records on www.hepdata.net to check that out. Finally, Errors for each variable should be split in to separate columns and labeled (e.g. stat, sys, etc.).

michal-szostak commented 9 years ago

As for errors this will generate variable number of columns (stat and sys will always be there, but of course data may have multiple systematic errors (lumi, background, etc). Is this approach acceptable?

eamonnmag commented 9 years ago

Yes, for sure. Bear in mind though that even if the errors are most definitely there in reality, they aren't always reported.

michal-szostak commented 9 years ago

Of course, I write only actual data, so if there are no reported errors there will be no error column in the outputted CSV

michal-szostak commented 9 years ago

This is sample output from Table 1 (sample inputs to new HEPData), I added white spaces and replaced '\t' with semicolons ';', to make it readable here. If there is no problem with how the CSV is generated I think we can close this issue.

Sample CSV

#: name: Table 1
#: description: The measured fiducial cross sections. The first systematic uncertainty is the combined systematic uncertainty excluding luminosity, the second is the luminosity
#: data_file: data1.yaml
#: keyword reactions: P P --> Z0 Z0 X
#: keyword observables: SIG
#: keyword energies: 7000
RE;              P P --> Z0 < LEPTON+ LEPTON- > Z0 < LEPTON+ LEPTON- > X                         
SQRT(S) IN GEV;  SIG(fiducial) IN FB;  stat +;  stat -;  sys +;   sys -;    sys,lumi +;   sys,lumi -
7000;            25.4;                  3.3;    -3.0;    1;       -1.2;     1;            -1        
8000;            29.8;                  3.8;    -3.5;    1.7;     -1.5;     1.2;          1.2       
9000;            12.7;                  3.1;    -2.9;    1.7;     1.7;      0.5;          0.5

eamonnmag commented 9 years ago

Great. How does it look when there are more complicated 'qualifiers', e.g. RE ... above the column headers? e.g. http://hepdata.cedar.ac.uk/view/ins1245023/d5

michal-szostak commented 9 years ago

It looks more or less the same as in the link you posted... In general qualifiers are posted on top of every independent variable (not over errors), so in case of two (or more) independent variables with multiple qualifiers it looks like this (Table 9 from sample HEPData submission):

#: name: Table 9
#: description: The observed and expected EmissT distribution in the dielectron SR-Z. The negigible estimated contribution from Z+jets is omitted in these distributions. The last bin contains the overflow.
#: data_file: data9.yaml
#: keyword energies: 8000
SQRT(S);             ;                    8000.0;   8000.0;                      ;           ;        8000.0;             8000.0;
EVENTS;              ;                    25;       25;                          ;           ;        25;                 25;
ETMISS IN GEV LOW;   ETMISS IN GEV HIGH;  Data;    Expected Background;          stat +;     stat -;  GGM 700 200 1.5;    GGM 900 600 1.5
200.0;               225.0;               0.0;     0.0;                          0.0;        0.0;     0.0;                0.0
225.0;               250.0;               6.0;     0.95;                         0.41;       -0.51;   6.46;               0.97
250.0;               275.0;               1.0;     0.9;                          0.41;       -0.26;   6.82;               1.07
275.0;               300.0;               1.0;     0.42;                         0.12;       -0.19;   2.82;               1.17
300.0;               325.0;               1.0;     0.34;                         0.16;       -0.15;   2.41;               1.05
325.0;               350.0;               2.0;     0.07;                         0.19;       -0.16;   3.11;               1.08
350.0;               375.0;               1.0;     0.68;                         0.56;       -0.55;   0.7 ;               1.13
375.0;               400.0;               1.0;     0.17;                         0.1;        -0.15;   0.9 ;               1.2
400.0;               425.0;               0.0;     0.24;                         0.11;       -0.1;    0.69;               1.01
425.0;               450.0;               1.0;     0.01;                         0.08;       0.08;    0.72;               0.94
450.0;               475.0;               0.0;     0.3;                          0.33;       0.33;    0.0 ;               0.88
475.0;               500.0;               2.0;     0.16;                         0.17;       -0.14;   0.93;               4.59

eamonnmag commented 9 years ago

Great. @GraemeWatt, are you happy with this?

GraemeWatt commented 9 years ago

It looks good, but we should try to improve on the old "plain text" format if possible. The main requirement is that it should be easy to parse with user code or standard CSV readers. Probably you know better than me if there are standards or conventions for CSV files that we should try to follow. I don't have a strong preference for the choice of column separator and we don't need to use tabs. Would the format be easier to parse if we start the lines giving qualifiers and headers with a "#:" or put quote marks around the headers containing spaces? A possible simplification would be to split up tables with multiple "dependent_variables" so that each table contains only a single set of qualifiers, then Table 9 above or http://hepdata.cedar.ac.uk/view/ins1245023/d5 would each be split into three tables (in the same CSV file) each containing only one "dependent_variable". We do this already for the YODA format (see #5), e.g. http://hepdata.cedar.ac.uk/view/ins1245023/d5/yoda.

michal-szostak commented 9 years ago

The problem with CSV is that it's by no means a "standard" there are a lot of different flavors. Comments (#:) are not really supported by most software libraries - so this is one of the things with which user will have to deal on their own.

As for wrapping text in quotes ('') I also think it is a good idea, and as for a column separator it might be either \t or ; , so the only problem is making definite decision.

As for splitting of dependent variable - if we're going to take this route (I'm not an user, so it's hard for me to answer whether it will be better or worse for potential code writers), we will need some way to state that the new table (dependent variable) is starting, something similar to the YODA's

# BEGIN YODA_SCATTER2D /REF/BELLE_2013_I1245023/d05-x01-y03
...
# END YODA_SCATTER2D

As I said above, I'm not a user, but usual convention is to have one table per CSV file, if we split one file into multiple tables, most libraries won't be able to handle it at all.

Maybe it will be better to just use YODA format (as it's human readable), and keep CSV as simple as possible. Any thoughts? @eamonnmag @GraemeWatt ?

eamonnmag commented 9 years ago

I would tend to agree. Splitting on dependent variables may not be a bad option since it's what the histfactory is doing. As long as the keys, so the independent variable remains the same, linking these together is not a problem. It would also allow researchers to only load in tables that they are interested in. Perhaps having the option to support a more 'verbose' output of files would be a good compromise.

michal-szostak commented 9 years ago

Splitting dependent variables but giving an option to select which to output (one specified, or all of them, or maybe even selected ones) might not be a bad idea. It would allow for greater flexibility for the users.

michal-szostak commented 9 years ago

Alright - so is this the official decision - split independent variables into separate tables, and provide API to export one, all, or only selected ones? @GraemeWatt - it this acceptable?

GraemeWatt commented 9 years ago

Yes, that sounds good. (For certain tables, maybe the user would prefer all "dependent_variables" to be given in the same table, like in the old "plain text" format, so probably that should still be supported as an option.)

michal-szostak commented 9 years ago

Just as a remainder: I added feature of exporting all tables from the new HEPData format, as have been discussed in #9. In this case tables are exported to files with the name of the table and .csv extension.

In this case converting function as an output parameter accepts directory path, instead of filepath or python fileobject

michal-szostak commented 9 years ago

As for the request for exporting dependent variables to separate tables within the same CSV file - this is the output of the current code:

#: name: Table 9
#: description: The observed and expected EmissT distribution in the dielectron SR-Z. The negigible estimated contribution from Z+jets is omitted in these distributions. The last bin contains the overflow.
#: data_file: data9.yaml
#: keyword energies: 8000
#: SQRT(S);      ;     8000.0
#: EVENTS;       ;     25
'ETMISS IN GEV LOW';   'ETMISS IN GEV HIGH';     'Data'
200.0;                 225.0;                  0.0
225.0;                 250.0;                  6.0
250.0;                 275.0;                  1.0
275.0;                 300.0;                  1.0
300.0;                 325.0;                  1.0
325.0;                 350.0;                  2.0
350.0;                 375.0;                  1.0
375.0;                 400.0;                  1.0
400.0;                 425.0;                  0.0
425.0;                 450.0;                  1.0
450.0;                 475.0;                  0.0
475.0;                 500.0;                  2.0

#: SQRT(S);      ;     8000.0
#: EVENTS;       ;     25
'ETMISS IN GEV LOW';   'ETMISS IN GEV HIGH';   'Expected Background';   'stat +';   'stat -'
200.0;                 225.0;                  0.0;                     0.0;        0.0
225.0;                 250.0;                  0.95;                    0.41;       -0.51
250.0;                 275.0;                  0.9;                     0.41;       -0.26
275.0;                 300.0;                  0.42;                    0.12;       -0.19
300.0;                 325.0;                  0.34;                    0.16;       -0.15
325.0;                 350.0;                  0.07;                    0.19;       -0.16
350.0;                 375.0;                  0.68;                    0.56;       -0.55
375.0;                 400.0;                  0.17;                    0.1;        -0.15
400.0;                 425.0;                  0.24;                    0.11;       -0.1
425.0;                 450.0;                  0.01;                    0.08;       0.08
450.0;                 475.0;                  0.3;                     0.33;       0.33
475.0;                 500.0;                  0.16;                    0.17;       -0.14

#: SQRT(S);      ;     8000.0
#: EVENTS;       ;     25
'ETMISS IN GEV LOW';   'ETMISS IN GEV HIGH';   'GGM 700 200 1.5'
200.0;                 225.0;                  0.0
225.0;                 250.0;                  6.46
250.0;                 275.0;                  6.82
275.0;                 300.0;                  2.82
300.0;                 325.0;                  2.41
325.0;                 350.0;                  3.11
350.0;                 375.0;                  0.7
375.0;                 400.0;                  0.9
400.0;                 425.0;                  0.69
425.0;                 450.0;                  0.72
450.0;                 475.0;                  0.0
475.0;                 500.0;                  0.93

#: SQRT(S);      ;     8000.0
#: EVENTS;       ;     25
'ETMISS IN GEV LOW';   'ETMISS IN GEV HIGH';   'GGM 900 600 1.5'
200.0;                 225.0;                  0.0
225.0;                 250.0;                  0.97
250.0;                 275.0;                  1.07
275.0;                 300.0;                  1.17
300.0;                 325.0;                  1.05
325.0;                 350.0;                  1.08
350.0;                 375.0;                  1.13
375.0;                 400.0;                  1.2
400.0;                 425.0;                  1.01
425.0;                 450.0;                  0.94
450.0;                 475.0;                  0.88
475.0;                 500.0;                  4.59

michal-szostak commented 9 years ago

As no further comments were provided - @eamonnmag ? @GraemeWatt ? I'm closing this issue, if necessary it can be reopened later.

eamonnmag commented 9 years ago

I think that's fine. Perhaps @graemewatt can test it out and provide feedback?

On Wed, 12 Aug 2015 11:16 michal-szostak notifications@github.com wrote:

As no further comments were provided - @eamonnmag https://github.com/eamonnmag ? @GraemeWatt https://github.com/GraemeWatt ? I'm closing this issue, if necessary it can be reopened later.

— Reply to this email directly or view it on GitHub https://github.com/HEPData/hepdata-converter/issues/2#issuecomment-130231691 .

HEPData / hepdata-converter

Convert from YAML format to CSV #2