Investigate conversion output for ROOT

michal-szostak commented 9 years ago

If by output for ROOT you mean root script (like the one here http://hepdata.cedar.ac.uk/view/ins1382590/d2/root) I don't think that it would be a problem, I might work on this in parallel with CSV format as they will probably share some similarities in how the data is processed.

By the way - how should I detect what type of data is in the table? HEPData frontend somehow does it, so it should be possible event without it being specified explicite @eamonnmag - I heard you were responsible for drawing data in frontend, can you share how you detect what kind of data it is?

eamonnmag commented 9 years ago

Great. Bringing @cranmer and @betatim in to the discussion here in case they have any preferences on how data should be exported to ROOT

GraemeWatt commented 9 years ago

The current "ROOT" export is really just a CINT script that makes a plot. We don't want to continue this format in the new system, at least not initially. Instead we should export the data to suitable ROOT objects (depending on the data type) and write them as a binary .root file rather than a CINT script. Histograms (TH1, TH2, etc.), or maybe graph objects (TGraph, TGraphAsymmErrors, etc.), will be the appropriate ROOT objects in most cases. If we manage to support HistFactory as an input format, it would be good to be able to export to the same format, and HistFactory might provide a nice way of organising the multiple histograms associated with a particular measurement.

michal-szostak commented 9 years ago

I was also thinking that it would be better to export directly to binary format instead of an interpreter script, thankfully root provides very good integration with python via ROOT package, so writing actual binary objects should be even easier than creating CINT (or its newer incarnation in root6), in addition to being faster for the client side.

eamonnmag commented 9 years ago

+1

michal-szostak commented 9 years ago

But my question still holds, how should I find out what type of data I'm working on (TH1, TH2, TGraph etc) - @eamonnmag?

eamonnmag commented 9 years ago

Pretty much all the current hepdata tables could be encoded as histograms in ROOT, so, TH1. Graeme or someone better versed with ROOT would have to confirm though. On 20 Jul 2015 15:53, "michal-szostak" notifications@github.com wrote:

But my question still holds, how should I find out what type of data I'm working on (TH1, TH2, TGraph etc) - @eamonnmag https://github.com/eamonnmag?

— Reply to this email directly or view it on GitHub https://github.com/HEPData/hepdata-converter/issues/3#issuecomment-122892373 .

betatim commented 9 years ago

Not sure I understand the question but you can access the type of a object stored in a ROOT file to find out if it is a TH1,2, etc (isintance(object, ROOT.TH1), beware of the slightly screwy inheritance structure though, isinstance(a_th2, ROOT.TH1) == True (!!))

eamonnmag commented 9 years ago

The question was more, "given a current table in HEPdata, what type of object in ROOT should represent it". Assume that there is no pre-existing ROOT object since all the files are being imported afresh.

michal-szostak commented 9 years ago

The question was a little different - ROOT is the output format, the input is "almost" raw data (with some metadata describing it) - https://github.com/HEPData/hepdata-submission. So in order to construct proper ROOT objects (TH1 or otherwise, I must make an educated quess what is the data representing)

michal-szostak commented 9 years ago

For now I'm creating file which looks like this (image in the attachement). The class used to store data is the same as in the original HEPData root output: TGraphAsymmErrors. I didn't use TH1 because I couldn't find a sensible way to insert errors into it (@GraemeWatt - any comment on this?). The TGraphAsymmErrors class itself provides only total errors - so I'm basically doing the same thing as with YODA format - I perform sum in quadrature on all the errors.

screenshot from 2015-08-20 15 01 19

GraemeWatt commented 9 years ago

That looks good. I agree that TGraphAsymmErrors is better than TH1 if we want only one ROOT object per dependent variable. You can write the low and high values of each bin as x errors and write the headers as axes labels instead of as a title. You should check that there is only one independent variable (if there are two independent variables, you would need something like TGraph2DErrors).

TGraphAsymmErrors objects are also used by YODA in their bin/yoda2root script (see include/YODA/ROOTCnv.h) when converting a YODA Scatter2D data type. By the way, src/WriterYODA.cc looks quite simple, so I think it is fine just to write the YODA format by hand (#5) without importing the Python interface. Parsing the YODA format (#10) is more difficult and might need to be done by a separate tool requiring the YODA package (and maybe also the Rivet package) to be installed, so I would leave that for the moment.

At a later stage we can try to write one TH1 object for the central value and each of the errors and use HistFactory to organise the multiple histograms, but that would be another output format.

cranmer commented 9 years ago

Hi

I haven’t read all of this, but very interested. Will reply more soon.

Quick feedback:

definitely the binary objects and not a script to make them
I would offer a histogram option. I may have a biased view, but that is the most used format for shipping the numbers around and people make the TGraphAsymmErrors at the end. So for convenience, I’d offer a TH1 based solution.

Kyle

On Aug 20, 2015, at 10:18 AM, GraemeWatt notifications@github.com wrote:

That looks good. I agree that TGraphAsymmErrors is better than TH1 if we want only one ROOT object per dependent variable. You can write the low and high values of each bin as x errors and write the headers as axes labels instead of as a title. You should check that there is only one independent variable (if there are two independent variables, you would need something like TGraph2DErrors).

TGraphAsymmErrors objects are also used by YODA https://yoda.hepforge.org/ in their bin/yoda2root script (see include/YODA/ROOTCnv.h) when converting a YODA Scatter2D data type. By the way, src/WriterYODA.cc looks quite simple, so I think it is fine just to write the YODA format by hand (#5 https://github.com/HEPData/hepdata-converter/issues/5) without importing the Python interface. Parsing the YODA format (#10 https://github.com/HEPData/hepdata-converter/issues/10) is more difficult and might need to be done by a separate tool requiring the YODA package (and maybe also the Rivet package) to be installed, so I would leave that for the moment.

At a later stage we can try to write one TH1 object for the central value and each of the errors and use HistFactory https://cdsweb.cern.ch/record/1456844 to organise the multiple histograms, but that would be another output format.

— Reply to this email directly or view it on GitHub https://github.com/HEPData/hepdata-converter/issues/3#issuecomment-133027075.

GraemeWatt commented 9 years ago

The problem with histograms is that often the bin widths are not given in existing HepData records (e.g. last record added) and for some observables it is not even meaningful to give bin widths. As far as I know (?), a TH1 with zero bin widths cannot be created in ROOT, whereas it is easy to create a TGraphAsymmErrors with zero x errors. But I guess we can support both options simultaneously, so that we don't need a separate ROOT export for use with HistFactory. We can write a TGraphAsymmErrors object with only total errors, then if the bin width is non-zero, we can also write separate TH1 objects for the central value and each of the individual errors.

michal-szostak commented 9 years ago

The good thing about ROOT format is that it can contain virtually unlimited number of object inside. So I would say that as a further enhancement we can provide histogram objects for the data that would allow it, and just write it to the directory of the table. This way user will have a choice which object to use.

The root file from which screenshot was taken is in the mszostak/root branch in git repository in hepdata-converter/hepdata_converter/testsuite/testdata/root/root.full (https://github.com/HEPData/hepdata-converter/blob/mszostak/root/hepdata_converter/testsuite/testdata/root/full.root). It will be updated after making improvements to the code.

As for multiple independent variables - I understand that for two I can (should) use TGraph2DErrors but what about more? Any suggestions? @GraemeWatt ?

So if current title becomes axis labels, what should the title of the graph / histogram be? Name of the table is a little bit ambiguous because there may be a couple of graphs / histograms per single table.

As for histograms (TH1, etc) one problem is difficulty of having zero bin, the other is expressing errors for y axis. How should this be resolved, if we are to provide data in the histogram objects? @GraemeWatt , @cranmer

eamonnmag commented 9 years ago

Really good progress on this. Thanks for the great work and advice from all.

michal-szostak commented 9 years ago

After a talk with @eamonnmag I remembered another problem which I encountered - ROOT embeds filepaths into binary files, which can be a security problem (and is generally something to be avoided) especially if the files are to be generated on the server side. @GraemeWatt - is there any way to clear root files from such metadata?

[EDIT] Additionally I fixed the issue with the axis naming and x axis error bars:

screenshot from 2015-08-21 14 54 55

GraemeWatt commented 9 years ago

Great, we can always refine things like titles later on. Another reason why TGraphAsymmErrors is better than a TH1 is that it can support cases where the bin focus is not in the middle of the bin (example). Here, the YAML format has value, low and high, so the TGraphAsymmErrors object should be centred on value with asymmetric x errors of value-low and high-value.

Depending on the number of independent variables, we should write either a TGraphAsymmErrors or TGraph2DErrors (only symmetric errors), or for histograms either TH1, TH2 or TH3. We also need to check that the data is numeric: probably alphanumeric independent variables can still be supported by ROOT (but will need special treatment), but alphanumeric dependent variables cannot. Indeed, we can find an appropriate ROOT object for most, but not all, data types. This was the main reason why we chose to use YAML rather than ROOT as the main input data format.

For writing histograms, we need to invent a standard naming format and write one histogram for the central value (e.g. y1, y2, y3 for three dependent variables), then the errors should be written into separate histograms [e.g. y1e1p and y1e1m (asymmetric error), y1e2 (symmetric error)], where the number after the y indicates the number of the dependent variable and the number after the e indicates the number of the error. Of course, each ROOT histogram potentially has errors, usually just set equal to the square root of the bin content, but we can just ignore that. You could use TH1::SetBinContent to fill the ROOT histogram. We need to first check that the data type is suitable for filling the ROOT histogram, e.g. that each high value matches the low value of the next bin and that high-low is greater than zero.

I don't know how to get rid of the file paths embedded in the binary .root file, but I'm not an expert.

michal-szostak commented 9 years ago

So in case of two independent variables (TGraph2DErrors) how should be the errors displayed if they are asymmetric? Taking the bigger error from both and using it as a symmetric one? Also the question still remains - what to do in case of i.e. 3 independent variables (@eamonnmag - can this situation happen?) Should such an input be impossible to export to root object or should we look for some workaround?

The same question applies to TH1 TH2 and TH3 they can go up to 3 independent variables, but not further - what would be the sensible way to handle 4 or more independent variables (it would be probably very rare thing to occur, but can not be excluded... or maybe can? @eamonnmag comments? ). And the question how to handle zero bin still remains. Usually independent variables just have 'value' without having specified 'high' & 'low' elements. In cases of independent variables having 'high' and 'low' entries the histogram can easily be created, but for others it may be problematic.

As for alphanumeric values I don't suppose we support them... or do we? Someone more knowledgeable in the matter can comment on this.

GraemeWatt commented 9 years ago

The YAML format was designed to be very flexible to support the diversity of data types already in the existing HepData system and that might be provided in future, i.e. any number of independent and dependent variables (which can possibly be non-numeric). For the ROOT export we need to be more selective and we should not aim to provide a ROOT object for all possible data types. We should check the data type and export to a suitable ROOT object only if it is possible. If not, then we don't write any ROOT object (or only write TGraphAsymmErrors and not TH1, etc.). For example, don't write a TGraph2DErrors unless the errors are symmetric, and don't write a TH1 unless the independent variable has low and high elements. For the more complicated cases, it will be up to the user to write their own ROOT objects starting from other formats like YAML or CSV. We can always expand the list of supported ROOT objects later on.

michal-szostak commented 9 years ago

Great, so to sum up:

tables with one independent variable and (possibly multiple dependent variables) are represented as TGraphAsymmErrors objects (one for every dependent variable)
if independent variable has 'low' & 'high' entries it will also be represented as TH1. Here is a question if the table has 2 independent variables, but only one has 'low' & 'high' elements can it be represented as a TH2 ? (also additional histograms with errors will be created)
If there are 2 independent variables both with symmetric errors TGraph2DErrors will be used
In all other cases if there are more than one independent variable what should be done? Right now I basically create TGraphAsymmErrors object for every pair of independent -> dependent variable. I'm not an user, so I can not say if this approach is useful for later data analysis.

Anything I forgot?

GraemeWatt commented 9 years ago

Sounds good. For a TH2 (or TH3) all the independent variables need to have low and high entries. For Table 2 of your current .root file, there are two independent variables so you should write one TGraph2DErrors object (with zero errors) rather than two TGraphAsymmErrors objects. You should probably check that all variables are numeric and don't write ROOT objects for alphanumeric variables. (At a later stage, alphanumeric independent variables could be supported by using the bin number to define the histogram and then using TAxis::SetBinLabel.)

michal-szostak commented 8 years ago

I updated root output (new version in master, as well as in PyPI (0.1.15). Now histograms for all errors are created. @GraemeWatt is this exactly what you wanted, or something is still lacking?

We can discuss naming conventions now - the one used at this moment (concatenated names of the axes) is pretty evident, but a little long, is it acceptable? Also some sanitization was necessary (removal of '/' character from names, which may cause confusion in some cases)

EDIT: sample root file (used in tests) with this new histogram output is available here: https://github.com/HEPData/hepdata-converter/blob/master/hepdata_converter/testsuite/testdata/root/full.root

GraemeWatt commented 8 years ago

Great, thanks a lot! But please also write a TGraphAsymmErrors object (with total errors) in addition to the histograms. This duplicates some information, but some users will prefer graph objects to histograms.

Yes, I think you need to change the names of the histograms. There is no need to reproduce long axes names in the histogram names. The histogram names should be short and easy to implement in user code. I made some suggestions for concise standard histogram names in a comment above, e.g. yi, yiej, yiejp, yiejm, where i is an integer labelling the number of the dependent variable in a particular table and j is an integer labelling the number of the error for a particular dependent variable.

@cranmer, could you please check that @michal-szostak's implementation satisfies your requirements for ROOT histogram output and provide feedback for improvements to be made?

michal-szostak commented 8 years ago

What about indication of independent variable? I think it should also be specified. Format like: x$i_y$j_e$p where $i ... are variables.

GraemeWatt commented 8 years ago

No, we should only write one ROOT object regardless of the number of independent (x) variables, e.g. for two independent variables, we write one TH2 object rather than two TH1 objects.

@lukasheinrich will now help with testing the ROOT output and work on related extensions (ROOT input, HistFactory input/output, etc.).

michal-szostak commented 8 years ago

But what if there are more independent variables than ROOT object can contain? (in this case more then 3)? Should the error be thrown, how this case should be handled?

GraemeWatt commented 8 years ago

We discussed this already above: just don't write any ROOT objects if there are too many independent variables. The majority of current HepData tables have only one or two independent variables. We should not aim to find a ROOT representation for all possible data formats of the YAML representation.

michal-szostak commented 8 years ago

Yes, but it still leaves problem with TGraph2DErrors which only accepts symmetric errors. So following this reasoning data with asymmetric errors and 2 independent variables should also be skipped, right?

GraemeWatt commented 8 years ago

Exactly, we only write ROOT objects if possible, so skip this case.

michal-szostak commented 8 years ago

Alright, shall we extend object's naming convention to normal histograms and graphs?

GraemeWatt commented 8 years ago

Yes, we should have a consistent naming scheme for all ROOT objects, e.g. Graph1D_y1, Hist1D_y1, etc. (A similar naming scheme should be used to write the YODA objects.)

michal-szostak commented 8 years ago

Ok, one last thing - what about histograms for single asymmetric error? I would suggest something like: Hist1D_y0_e0+ & Hist1D_y0_e0- or Hist1D_y0_e0plus & Hist1D_y0_e0minus - what do you think? And clarification on the indexing (I know it's rather useless debate, but maybe you have some already in place conventions) - should dependent variables / errors be counted from 0 or 1?

GraemeWatt commented 8 years ago

I think we should count starting from 1 for compatibility with the existing YODA output, and + and - symbols in names can cause problems in code, so Hist1D_y1_e1plus and Hist1D_y1_e1minus is better.

michal-szostak commented 8 years ago

I pushed new version to master, new PyPI package is also available (version 0.1.16). All above comments has been included. Can you check @GraemeWatt whether I missed something? (example file: https://github.com/HEPData/hepdata-converter/blob/master/hepdata_converter/testsuite/testdata/root/full.root)

cranmer commented 8 years ago

Will try to check out the implementation as requested.

Kyle

On Sep 16, 2015, at 12:25 PM, GraemeWatt notifications@github.com wrote:

Great, thanks a lot! But please also write a TGraphAsymmErrors object (with total errors) in addition to the histograms. This duplicates some information, but some users will prefer graph objects to histograms.

Yes, I think you need to change the names of the histograms. There is no need to reproduce long axes names in the histogram names. The histogram names should be short and easy to implement in user code. I made some suggestions for concise standard histogram names in a comment above, e.g. yi, yiej, yiejp, yiejm, where i is an integer labelling the number of the dependent variable in a particular table and j is an integer labelling the number of the error for a particular dependent variable.

@cranmer https://github.com/cranmer, could you please check that @michal-szostak https://github.com/michal-szostak's implementation satisfies your requirements for ROOT histogram output and provide feedback for improvements to be made?

— Reply to this email directly or view it on GitHub https://github.com/HEPData/hepdata-converter/issues/3#issuecomment-140793981.

HEPData / hepdata-converter

Investigate conversion output for ROOT #3