Different contractions have different output file structures

edbennett commented 1 year ago

I've noticed in recent work that the Meson contraction outputs a rather different XML structure to the WardIdentity contraction. This means downstream tooling to read and analyse these quantities needs to account separately for the two classes of input, rather than using a single generic reader.

Is there a strong reason for this choice? Would it be possible to make them more consistent?

aportelli commented 1 year ago

Hi @edbennett, in the spirit of having a modular structure, all modules can potentially output results in different formats, which was a design choice from the start as data from measurement can be really heterogeneous.

However, the key thing is that all outputted data rely on Grid serialisable classes (which automatically generate code for readxer/writer), or on custom classes that provide their own reader/writer. So in practice an analysis program can be linked against Hadrons and a user never has to walk through the data structure on their own. That was a reality in my own data analysis workflow.

It does not mean that this is ideal, I think there is a number of things which are a bit rigid about it which I would like to address. I am currently drafting a roadmap with for a major update of the whole data serialisation aspect in Hadrons, and I hope it can be released within the next year.

edbennett commented 1 year ago

Thanks Antonin for the extra information. I was chatting with Ryan about this in Swansea last week.

I don't think modularity by necessity imposes that people can do what they want—especially when modules are hosted in the same repository. There are plenty of constraints on what the modules can do already so that they can interoperate successfully.

In an ideal world it would be great if we could standardise on a format (or system of formats/schemas) for observables. Tying the serialised representation to a specific C++ implementation of the computation seems to be a barrier to interoperatbility—lots of groups create correlation functions, and not all of them will do it based on Grid or Hadrons.

Exciting to hear that there is work on improvement to the serialisation, though! If this will move away from Grid's serialisation, then I'll highlight a feature request I made to Grid (thinking that Hadrons would pick it up automatically if Grid adopted it)—tracking of provenance in a standardised, interoperable way. It would be really good if we could make it easier to know which ensembles and trajectories were used to generate which data points and papers, while also removing opportunities for these to accidentally become inconsistent.

aportelli commented 1 year ago

Hi @edbennett thanks, I am not sure I agree with everything but also this is too high-level. With these things the devil is in the details. Hadrons generates correlator files in the HDF5 format, which you could agree is a standardisation. In practice I have not seen the free structure of the underlying file being a barrier so far, as HDF5 is supported by virtually any language relevant for data analysis.

Lattice field theory measurement are very heterogenous in nature, and it is hard to believe there is a set of metadata that can be standardised. I still think we should make more effort in the directions you indicated. However I believe that design and standardisation should come before implementation, and before putting strong constraints on implementations I would prefer seeing more momentum in the community to standardise data sharing.

That's why, at the moment, it sounds safer to me have free-form data formats in software, rather than risking imposing standards which might not be adopted. If there is a community consensus on some of these things we will certainly implement them, as we already have done with the ILDG metadata in Grid.

edbennett commented 1 year ago

Hadrons generates correlator files in the HDF5 format, which you could agree is a standardisation. In practice I have not seen the free structure of the underlying file being a barrier so far, as HDF5 is supported by virtually any language relevant for data analysis.

Thanks Antonin. I suspect user error is part of the issue then—I am using the XML serialisation rather than HDF5. Perhaps if I switch over my life will become easier.

That said, if e.g. two tools both use HDF5 but use different column name conventions, then having the data interoperate with each other becomes harder, so even in a standardised wrapper format there is space to provide structure that further aids interoperability.

However I believe that design and standardisation should come before implementation, and before putting strong constraints on implementations I would prefer seeing more momentum in the community to standardise data sharing.

I don't disagree—I hope to do some reading as to how such consensuses formed in other fields: whether it was an existing format that others coalesced around, or whether a specification was designed from first principles and then implementations written.

If there is a community consensus on some of these things we will certainly implement them

Hopefully I can work out how this can be achieved :)

aportelli / Hadrons

Different contractions have different output file structures #113