Closed eamonnmag closed 8 years ago
Great. Bringing @cranmer and @betatim in to the discussion here in case they have any preferences on how data should be exported to ROOT
The current "ROOT" export is really just a CINT script that makes a plot. We don't want to continue this format in the new system, at least not initially. Instead we should export the data to suitable ROOT objects (depending on the data type) and write them as a binary .root
file rather than a CINT script. Histograms (TH1
, TH2
, etc.), or maybe graph objects (TGraph
, TGraphAsymmErrors
, etc.), will be the appropriate ROOT objects in most cases. If we manage to support HistFactory as an input format, it would be good to be able to export to the same format, and HistFactory might provide a nice way of organising the multiple histograms associated with a particular measurement.
I was also thinking that it would be better to export directly to binary format instead of an interpreter script, thankfully root provides very good integration with python via ROOT package, so writing actual binary objects should be even easier than creating CINT (or its newer incarnation in root6), in addition to being faster for the client side.
+1
But my question still holds, how should I find out what type of data I'm working on (TH1
, TH2
, TGraph
etc) - @eamonnmag?
Pretty much all the current hepdata tables could be encoded as histograms in ROOT, so, TH1. Graeme or someone better versed with ROOT would have to confirm though. On 20 Jul 2015 15:53, "michal-szostak" notifications@github.com wrote:
But my question still holds, how should I find out what type of data I'm working on (TH1, TH2, TGraph etc) - @eamonnmag https://github.com/eamonnmag?
— Reply to this email directly or view it on GitHub https://github.com/HEPData/hepdata-converter/issues/3#issuecomment-122892373 .
Not sure I understand the question but you can access the type of a object stored in a ROOT file to find out if it is a TH1,2, etc (isintance(object, ROOT.TH1)
, beware of the slightly screwy inheritance structure though, isinstance(a_th2, ROOT.TH1) == True
(!!))
The question was more, "given a current table in HEPdata, what type of object in ROOT should represent it". Assume that there is no pre-existing ROOT object since all the files are being imported afresh.
The question was a little different - ROOT is the output format, the input is "almost" raw data (with some metadata describing it) - https://github.com/HEPData/hepdata-submission. So in order to construct proper ROOT objects (TH1
or otherwise, I must make an educated quess what is the data representing)
For now I'm creating file which looks like this (image in the attachement). The class used to store data is the same as in the original HEPData root output: TGraphAsymmErrors
. I didn't use TH1 because I couldn't find a sensible way to insert errors into it (@GraemeWatt - any comment on this?). The TGraphAsymmErrors
class itself provides only total errors - so I'm basically doing the same thing as with YODA format - I perform sum in quadrature on all the errors.
That looks good. I agree that TGraphAsymmErrors
is better than TH1
if we want only one ROOT object per dependent variable. You can write the low and high values of each bin as x errors and write the headers as axes labels instead of as a title. You should check that there is only one independent variable (if there are two independent variables, you would need something like TGraph2DErrors
).
TGraphAsymmErrors
objects are also used by YODA in their bin/yoda2root
script (see include/YODA/ROOTCnv.h
) when converting a YODA Scatter2D
data type. By the way, src/WriterYODA.cc
looks quite simple, so I think it is fine just to write the YODA format by hand (#5) without importing the Python interface. Parsing the YODA format (#10) is more difficult and might need to be done by a separate tool requiring the YODA package (and maybe also the Rivet package) to be installed, so I would leave that for the moment.
At a later stage we can try to write one TH1
object for the central value and each of the errors and use HistFactory to organise the multiple histograms, but that would be another output format.
Hi
I haven’t read all of this, but very interested. Will reply more soon.
Quick feedback:
Kyle
On Aug 20, 2015, at 10:18 AM, GraemeWatt notifications@github.com wrote:
That looks good. I agree that TGraphAsymmErrors is better than TH1 if we want only one ROOT object per dependent variable. You can write the low and high values of each bin as x errors and write the headers as axes labels instead of as a title. You should check that there is only one independent variable (if there are two independent variables, you would need something like TGraph2DErrors).
TGraphAsymmErrors objects are also used by YODA https://yoda.hepforge.org/ in their bin/yoda2root script (see include/YODA/ROOTCnv.h) when converting a YODA Scatter2D data type. By the way, src/WriterYODA.cc looks quite simple, so I think it is fine just to write the YODA format by hand (#5 https://github.com/HEPData/hepdata-converter/issues/5) without importing the Python interface. Parsing the YODA format (#10 https://github.com/HEPData/hepdata-converter/issues/10) is more difficult and might need to be done by a separate tool requiring the YODA package (and maybe also the Rivet package) to be installed, so I would leave that for the moment.
At a later stage we can try to write one TH1 object for the central value and each of the errors and use HistFactory https://cdsweb.cern.ch/record/1456844 to organise the multiple histograms, but that would be another output format.
— Reply to this email directly or view it on GitHub https://github.com/HEPData/hepdata-converter/issues/3#issuecomment-133027075.
The problem with histograms is that often the bin widths are not given in existing HepData records (e.g. last record added) and for some observables it is not even meaningful to give bin widths. As far as I know (?), a TH1
with zero bin widths cannot be created in ROOT, whereas it is easy to create a TGraphAsymmErrors
with zero x errors. But I guess we can support both options simultaneously, so that we don't need a separate ROOT export for use with HistFactory. We can write a TGraphAsymmErrors
object with only total errors, then if the bin width is non-zero, we can also write separate TH1
objects for the central value and each of the individual errors.
The good thing about ROOT format is that it can contain virtually unlimited number of object inside. So I would say that as a further enhancement we can provide histogram objects for the data that would allow it, and just write it to the directory of the table. This way user will have a choice which object to use.
The root file from which screenshot was taken is in the mszostak/root branch in git repository in hepdata-converter/hepdata_converter/testsuite/testdata/root/root.full (https://github.com/HEPData/hepdata-converter/blob/mszostak/root/hepdata_converter/testsuite/testdata/root/full.root). It will be updated after making improvements to the code.
As for multiple independent variables - I understand that for two I can (should) use TGraph2DErrors
but what about more? Any suggestions? @GraemeWatt ?
So if current title becomes axis labels, what should the title of the graph / histogram be? Name of the table is a little bit ambiguous because there may be a couple of graphs / histograms per single table.
As for histograms (TH1
, etc) one problem is difficulty of having zero bin, the other is expressing errors for y axis. How should this be resolved, if we are to provide data in the histogram objects? @GraemeWatt , @cranmer
Really good progress on this. Thanks for the great work and advice from all.
After a talk with @eamonnmag I remembered another problem which I encountered - ROOT embeds filepaths into binary files, which can be a security problem (and is generally something to be avoided) especially if the files are to be generated on the server side. @GraemeWatt - is there any way to clear root files from such metadata?
[EDIT] Additionally I fixed the issue with the axis naming and x axis error bars:
Great, we can always refine things like titles later on. Another reason why TGraphAsymmErrors
is better than a TH1
is that it can support cases where the bin focus is not in the middle of the bin (example). Here, the YAML format has value
, low
and high
, so the TGraphAsymmErrors
object should be centred on value
with asymmetric x errors of value-low
and high-value
.
Depending on the number of independent variables, we should write either a TGraphAsymmErrors
or TGraph2DErrors
(only symmetric errors), or for histograms either TH1
, TH2
or TH3
. We also need to check that the data is numeric: probably alphanumeric independent variables can still be supported by ROOT (but will need special treatment), but alphanumeric dependent variables cannot. Indeed, we can find an appropriate ROOT object for most, but not all, data types. This was the main reason why we chose to use YAML rather than ROOT as the main input data format.
For writing histograms, we need to invent a standard naming format and write one histogram for the central value (e.g. y1
, y2
, y3
for three dependent variables), then the errors should be written into separate histograms [e.g. y1e1p
and y1e1m
(asymmetric error), y1e2
(symmetric error)], where the number after the y
indicates the number of the dependent variable and the number after the e
indicates the number of the error. Of course, each ROOT histogram potentially has errors, usually just set equal to the square root of the bin content, but we can just ignore that. You could use TH1::SetBinContent
to fill the ROOT histogram. We need to first check that the data type is suitable for filling the ROOT histogram, e.g. that each high
value matches the low
value of the next bin and that high-low
is greater than zero.
I don't know how to get rid of the file paths embedded in the binary .root
file, but I'm not an expert.
So in case of two independent variables (TGraph2DErrors
) how should be the errors displayed if they are asymmetric? Taking the bigger error from both and using it as a symmetric one?
Also the question still remains - what to do in case of i.e. 3 independent variables (@eamonnmag - can this situation happen?) Should such an input be impossible to export to root object or should we look for some workaround?
The same question applies to TH1
TH2
and TH3
they can go up to 3 independent variables, but not further - what would be the sensible way to handle 4 or more independent variables (it would be probably very rare thing to occur, but can not be excluded... or maybe can? @eamonnmag comments? ). And the question how to handle zero bin still remains. Usually independent variables just have 'value' without having specified 'high' & 'low' elements. In cases of independent variables having 'high' and 'low' entries the histogram can easily be created, but for others it may be problematic.
As for alphanumeric values I don't suppose we support them... or do we? Someone more knowledgeable in the matter can comment on this.
The YAML format was designed to be very flexible to support the diversity of data types already in the existing HepData system and that might be provided in future, i.e. any number of independent and dependent variables (which can possibly be non-numeric). For the ROOT export we need to be more selective and we should not aim to provide a ROOT object for all possible data types. We should check the data type and export to a suitable ROOT object only if it is possible. If not, then we don't write any ROOT object (or only write TGraphAsymmErrors
and not TH1
, etc.). For example, don't write a TGraph2DErrors
unless the errors are symmetric, and don't write a TH1
unless the independent variable has low
and high
elements. For the more complicated cases, it will be up to the user to write their own ROOT objects starting from other formats like YAML or CSV. We can always expand the list of supported ROOT objects later on.
Great, so to sum up:
TGraphAsymmErrors
objects (one for every dependent variable)TH1
. Here is a question if the table has 2 independent variables, but only one has 'low' & 'high' elements can it be represented as a TH2
? (also additional histograms with errors will be created)TGraph2DErrors
will be usedTGraphAsymmErrors
object for every pair of independent -> dependent variable. I'm not an user, so I can not say if this approach is useful for later data analysis.Anything I forgot?
Sounds good. For a TH2
(or TH3
) all the independent variables need to have low
and high
entries. For Table 2 of your current .root
file, there are two independent variables so you should write one TGraph2DErrors
object (with zero errors) rather than two TGraphAsymmErrors
objects. You should probably check that all variables are numeric and don't write ROOT objects for alphanumeric variables. (At a later stage, alphanumeric independent variables could be supported by using the bin number to define the histogram and then using TAxis::SetBinLabel
.)
I updated root output (new version in master, as well as in PyPI (0.1.15). Now histograms for all errors are created. @GraemeWatt is this exactly what you wanted, or something is still lacking?
We can discuss naming conventions now - the one used at this moment (concatenated names of the axes) is pretty evident, but a little long, is it acceptable? Also some sanitization was necessary (removal of '/'
character from names, which may cause confusion in some cases)
EDIT: sample root file (used in tests) with this new histogram output is available here: https://github.com/HEPData/hepdata-converter/blob/master/hepdata_converter/testsuite/testdata/root/full.root
Great, thanks a lot! But please also write a TGraphAsymmErrors
object (with total errors) in addition to the histograms. This duplicates some information, but some users will prefer graph objects to histograms.
Yes, I think you need to change the names of the histograms. There is no need to reproduce long axes names in the histogram names. The histogram names should be short and easy to implement in user code. I made some suggestions for concise standard histogram names in a comment above, e.g. yi
, yiej
, yiejp
, yiejm
, where i
is an integer labelling the number of the dependent variable in a particular table and j
is an integer labelling the number of the error for a particular dependent variable.
@cranmer, could you please check that @michal-szostak's implementation satisfies your requirements for ROOT histogram output and provide feedback for improvements to be made?
What about indication of independent variable? I think it should also be specified. Format like: x$i_y$j_e$p
where $i ...
are variables.
No, we should only write one ROOT object regardless of the number of independent (x) variables, e.g. for two independent variables, we write one TH2
object rather than two TH1
objects.
@lukasheinrich will now help with testing the ROOT output and work on related extensions (ROOT input, HistFactory input/output, etc.).
But what if there are more independent variables than ROOT object can contain? (in this case more then 3)? Should the error be thrown, how this case should be handled?
We discussed this already above: just don't write any ROOT objects if there are too many independent variables. The majority of current HepData tables have only one or two independent variables. We should not aim to find a ROOT representation for all possible data formats of the YAML representation.
Yes, but it still leaves problem with TGraph2DErrors which only accepts symmetric errors. So following this reasoning data with asymmetric errors and 2 independent variables should also be skipped, right?
Exactly, we only write ROOT objects if possible, so skip this case.
Alright, shall we extend object's naming convention to normal histograms and graphs?
Yes, we should have a consistent naming scheme for all ROOT objects, e.g. Graph1D_y1
, Hist1D_y1
, etc. (A similar naming scheme should be used to write the YODA objects.)
Ok, one last thing - what about histograms for single asymmetric error? I would suggest something like: Hist1D_y0_e0+
& Hist1D_y0_e0-
or Hist1D_y0_e0plus
& Hist1D_y0_e0minus
- what do you think? And clarification on the indexing (I know it's rather useless debate, but maybe you have some already in place conventions) - should dependent variables / errors be counted from 0 or 1?
I think we should count starting from 1 for compatibility with the existing YODA output, and +
and -
symbols in names can cause problems in code, so Hist1D_y1_e1plus
and Hist1D_y1_e1minus
is better.
I pushed new version to master, new PyPI package is also available (version 0.1.16). All above comments has been included. Can you check @GraemeWatt whether I missed something? (example file: https://github.com/HEPData/hepdata-converter/blob/master/hepdata_converter/testsuite/testdata/root/full.root)
Will try to check out the implementation as requested.
Kyle
On Sep 16, 2015, at 12:25 PM, GraemeWatt notifications@github.com wrote:
Great, thanks a lot! But please also write a TGraphAsymmErrors object (with total errors) in addition to the histograms. This duplicates some information, but some users will prefer graph objects to histograms.
Yes, I think you need to change the names of the histograms. There is no need to reproduce long axes names in the histogram names. The histogram names should be short and easy to implement in user code. I made some suggestions for concise standard histogram names in a comment above, e.g. yi, yiej, yiejp, yiejm, where i is an integer labelling the number of the dependent variable in a particular table and j is an integer labelling the number of the error for a particular dependent variable.
@cranmer https://github.com/cranmer, could you please check that @michal-szostak https://github.com/michal-szostak's implementation satisfies your requirements for ROOT histogram output and provide feedback for improvements to be made?
— Reply to this email directly or view it on GitHub https://github.com/HEPData/hepdata-converter/issues/3#issuecomment-140793981.
If by output for ROOT you mean root script (like the one here http://hepdata.cedar.ac.uk/view/ins1382590/d2/root) I don't think that it would be a problem, I might work on this in parallel with CSV format as they will probably share some similarities in how the data is processed.
By the way - how should I detect what type of data is in the table? HEPData frontend somehow does it, so it should be possible event without it being specified explicite @eamonnmag - I heard you were responsible for drawing data in frontend, can you share how you detect what kind of data it is?