Storage of multiple independent data sets in single NeXus file

graeme-winter commented 5 years ago

In several places serial oscillation data collection is performed with one arm and multiple triggers, typically with each data set an identical size - examples are:

https://zenodo.org/record/1442922 https://zenodo.org/record/2539519

(other examples exist)

Should we use VDS for this as well? At the moment various methods are used to give "hints" that this is the case e.g. omega data sets which have been rewritten. This is a closely related problem to #1 though different in that there is no joint [UB] matrix.

graeme-winter commented 4 years ago

I note on Eigers with the dectris file writer we have

Grey-Area andrew-12-trigger-eiger :) $ h5dump -d /entry/instrument/detector/detectorSpecific/ntrigger example_16_1_master.h5 
HDF5 "example_16_1_master.h5" {
DATASET "/entry/instrument/detector/detectorSpecific/ntrigger" {
   DATATYPE  H5T_STD_U32LE
   DATASPACE  SCALAR
   DATA {
   (0): 12
   }
}
}

which is a hint - and I would be much happier if it were outside detectorSpecific

epanepucci commented 4 years ago

At the SLS we have a daq method called bookmarked data collection which is basically a discrete helical'ish scan. The user pre-aligns positions on a large crystal and the data acquisition engine splits the total range across the number of pre-defined positions. With the Eiger we use one arm and multiple triggers (one trigger per position). The dataset appears as if it were taken in a single sweep and is processed as such.

gsantoni commented 4 years ago

@epanepucci for pseudo-helical that is correct, but here we need to keep the information that each data collection comes from a separate crystal and thus has to be in a first step processed independently from the others, before going forward with the analysis. @graeme-winter sorry for the dumb question, what do you mean by VDS ?

biochem-fan commented 4 years ago

@keitaroyam (the depositor of the first dataset and former beam line scientist at BL32XU): Do you modify the master H5 after written by EIGER?

I also would like to raise the point about the definition of "serial crystallography". https://www.jiscmail.ac.uk/cgi-bin/webadmin?A2=ind1907&L=CCP4BB&P=R183063 Personally I agree with this Thomas White's post. Whatever consensus we reach in future, this term should be defined properly when we write specification.

graeme-winter commented 4 years ago

@gsantoni VDS is virtual data set - you can have e.g. all the data in a "real" data set which consists of say 1200 images, then you can have virtual data sets which refer into this which could encode the "true" structure of the data.

In a more complex case (let's say you have 8 x 150 image segments across 12 x 100 image h5 data files) you can sort the mappings out so it looks like you just have 8 x 150 image data sets. I have some example code somewhere, I will dig it out.

graeme-winter commented 4 years ago

@biochem-fan re: "serial oscillation crystallography" - I was trying to capture the idea of recording a large number of small data sets as a sequence - so this does lie between the single data set and still shot crystallography use cases. I am not all that interested in the name we chose to use (i.e. I have no great franchise in the outcome) beyond there being some definition which can capture this idea. Perhaps "sequential oscillation" vs. "serial still" vs. "serial oscillation" or something.

fleon-psi commented 4 years ago

Is it better not to mix the two things - high-level ordering of serial oscillation datasets and low-level information on trigger? One can use multiple triggers for many different reasons, e.g. to synchronise with lasers, shutter, goniometer position or other devices and it doesn't tell if the dataset is sequential oscillation, serial still, serial oscillation or something else.

graeme-winter commented 4 years ago

@fleon-psi - this lines up with what I am looking for - to have the structure of the data defined in an explicit way i.e. not "guessing" based on anything in detectorSpecific

@biochem-fan reading again - I believe when I talked to @keitaroyam about this it was the case that the master file is rewritten (I have a memory of a script to do this being included in the deposition?)

graeme-winter commented 4 years ago

Here we go: https://github.com/keitaroyam/yamtbx/blob/master/doc/eiger-en.md#modification-on-masterh5-at-spring-8

biochem-fan commented 4 years ago

@graeme-winter Great, you found it yourself :) I remembered he mentioned this but I couldn't find the details, so I tagged him to explain himself.

graeme-winter commented 4 years ago

Reading https://manual.nexusformat.org/rules.html it would seem that the correct way to do this is to have > 1 NXentry at the top level -

/entry1
/entry2

etc. however I would imagine that this would cause many issues with analysis code which assumes the existing file structure.

graeme-winter commented 4 years ago

OK, I am pulling things from the grey-matter version of tape -

https://zenodo.org/record/3611103

https://github.com/graeme-winter/NXmxtools/blob/master/vdsmaker.py

Is a script to make 6 x 300 image virtual data sets from 1 x 1800 images @gsantoni - so we have the "real" data recorded in the usual way by the detector (or by ODIN or whatever) in data_00000N.h5 and then as many "NeXus" (i.e. master-like) files as needed which refer into these - note in this example that the virtual data sets deliberately do not map cleanly to data_00000N.h5. I guess this is an appropriate model as whatever reads the top level NeXus files need not know or care about the underlying file structure. I should check that these work using direct chunk read... I am fairly sure I checked this at the time with XDS using Durin.

CV-GPhL commented 4 years ago

Not sure this is relevant or useful, but within autoPROC we look at the /entry/instrument/detector/detectorSpecific/ntrigger value and the rotation axis values (not just Omega, but also Kappa, Chi and Phi if recorded) to distinguish between

helical scan, i.e. same goniostat setting with consecutive scan axis
multi-orientation datasets (different goniostat settings)
serial datasets (same gonistat settings and same scan axis sequence)

Caveat-1: this of course doesn't catch a whole lot of other cases, but seems to describe "typical" use cases Caveat-2: this hasn't been widely tested (apart from the helical scans from SLS Zac mentioned), so might well not work as expected.

graeme-winter commented 4 years ago

@CV-GPhL makes sense, and describes exactly the sorts of heuristics that NeXus was designed to avoid 🙂 - it was us starting to write exactly the same logic into xia2 / dials which got me thinking that this is surely not ideal (I have a similar hack in xia2 now, but I'm not really happy about it)

phyy-nx commented 4 years ago

Multiple entry records in a single NeXus master does seem like the best way to do this, in combination with virtual datasets to handle the mapping. Makes sense to me.

keitaroyam commented 4 years ago

Can't agree more. As @biochem-fan mentioned master.h5 files at SPring-8 are modified to make omega table right, but it is still confusing for users.

graeme-winter commented 4 years ago

@phyy-nx - I see this, however I wonder how a user in e.g. XDS.INP will specify which dataset to process. Clearly in something like xia2 I can mangle the user-facing API to my heart's content, and there is also already support for processing one or more directories of data, which will work fine.

But it's not software I am involved with I am concerned with here - it is software which is in widespread use where the developers of that software are not "in the loop" on these decisions. Any best practice needs to fit in with their use case.

I have previously considered the idea of slightly splitting the conventions / best practice for those files named foo_master.h5 and those named bar.nxs e.g. in the former insisting that there is exactly one /entry with one apparent data set (which could be virtual, as above) and allowing the latter to adopt all the bells & whistles afforded by the full gamut of NeXus features. This may inside contain several data sets which refer into one or more virtual data sets.

We can then encourage facilities to produce both - the foo_master.h5 will effectively be guaranteed to "play nice" with XDS (even if not with neggia) and will be familiar to users. The bar.nxs representation can include the full relationship between different data sets for long term provenance and for those software which can take advantage of such information (e..g autoPROC and xia2)

Any thoughts on this proposal?

Clearly when faced with data from a dectris file writer some rewriting of master files will be necessary to address any of the requirements we describe in here.

HDRMX / NXmx

Storage of multiple independent data sets in single NeXus file #2