Standardize output API for the different readers

dmpelt commented 7 years ago

Some functions now read properties from the file, returning them as the fourth output variable. Others however, only return three variables (tomo, flat, and dark), or fewer. Should we try to standardize this output API a bit to make it easier to write code/scripts that supports all readers?

One way would be to always return four variables: tomo, flat, dark, and a Python dictionary with all metadata that can be read (or is requested through input arguments). This Python dictionary could have a public 'API' as well, for example always having the projection angles accessible with the 'angles' key. To check whether angles are actually available, a user could then do something like if 'angles' in metadata. Another option is to always return only one variable: a Python dictionary, with the 'tomo' etc keys representing the tomo etc arrays, but that might be a bit too generic.

These changes will all be very easy to implement, but we would first have to decide on what we want.

michael-sutherland commented 4 years ago

I like the metadata dict. It would be great if we had a standard to follow. I'm adding another reader right now and I'm making up names for variables (should xray energy be XEng, xeng, xray_energy, x_ray_energy, etc). Any suggestions for naming?

michael-sutherland commented 4 years ago

Were I to design something from the ground up, I'd like to create something with a pandas like interface and lazy loading for images. Then we could slice the data and metadata and only load the images we want.

prjemian commented 4 years ago

from Jon Tischler, APS

prjemian commented 4 years ago

How many different ways can you describe a single term? Best to pick one way. I suggest the most obvious one: energy

jdgelb commented 4 years ago

I like "energy" too. Is the plan to write this in the exchange group? I'm logging energy right now as a matter of course, but currently stashing it in /measurement/acquisition/source (for lack of a more convenient location!). Whether or not this value belongs in the exchange group though I think depends on intended usage. Is it a single value for a single /exchange/data object? Or if somebody was to do a XANES tomography, would there be multiple data objects and, thus, a list of energies? Or is the general protocol one energy entry for each image in the /exchange/data object, thus always a list (just like theta)?

michael-sutherland commented 4 years ago

I think we could have a similar discussion for a lot of variables (angles vs thetas, stage names, etc). I'm hoping to NOT reinvent the wheel if there is a good existing naming convention. I found NXtomo, as an example to picking a standard and adopting it for our naming.

decarlof commented 4 years ago

@michael-sutherland, all,

Several years ago, I used Jon Tischler, NXtomo (which was at the very early stage with many tomographic definition not agreed on) and Felipe Maia/Stefano Marchesini cxidb (https://www.cxidb.org) and asked the tomography community to define and adopt a common list of tags to label a tomographic scan with data collection and data analyisis meta data.

The result is dxfile https://dxfile.readthedocs.io/en/latest/source/xraytomo.html that has labels mostly identical that of the above (NX and cxibd) the only notable exception is in NXtomo the use of

image_key[nFrames]: (required) NX_INT

which was not agreed since all synchrotrons using the area detector HDF plug can directly store, at data collection time, the data in the corresponding hdf "folder" label (data, data_dark, data_white), which was considered a simplification and also allowed an easy determinantion of when dark/white where taken if interlaced among projections.

This is now adopted by APS 2-BM, 7-BM, 13-BM, 32-ID, Elettra, SLS tomcat and as 2 weeks ago Dula from the ALS confirmed they are also switching over.

For 2-BM, 7-BM, 13-BM and 32-ID the hdf plugin definitions as used in productions are at:

https://dxfile.readthedocs.io/en/latest/source/demo/doc.areadetector.html

My suggestion would be to use the same, and, if some label is missing just make a pull request to https://github.com/data-exchange/dxfile or propose it to the NX community.

f

just last month I started to followup on the same idea as suggested by @dmpelt and drafter this.

decarlof commented 4 years ago

btw do you know some one using NXtomo in production? Here I can find an old reader from Diamond but is for tiff.

michael-sutherland commented 4 years ago

Thanks @decarlof . Would you suggesting using the path in the dictionary? For our example it would be "/measurement/instrument/monochromator/energy" (see dxfile/energy). I'm currently creating a reader for databroker for NSLS-2 FXI-18, so I need to do data mapping instead of storing things in a common HDF5 format.

BTW, I was linked to NXTomo by DECTRIS

prjemian commented 4 years ago

I agree that HDF5 path is intuitive, tells how that energy was obtained.

decarlof commented 4 years ago

Thanks @decarlof . Would you suggesting using the path in the dictionary? For our example it would be "/measurement/instrument/monochromator/energy" (see dxfile/energy). I'm currently creating a reader for databroker for NSLS-2 FXI-18, so I need to do data mapping instead of storing things in a common HDF5 format.

yes this is what I meant. If useful, I have an example of how to recursively read in an hdf file here.

BTW, I was linked to NXTomo by DECTRIS

thanks!

decarlof commented 4 years ago

@michael-sutherland I am thinking more @prjemian comment

I agree that HDF5 path is intuitive, tells how that energy was obtained.

and although /measurement/instrument/monochromator/energy is more informative, it would be difficult to agree on a common path agreed across facilities for a dictionary. Maybe is easier to simply use energy instead ...

decarlof commented 4 years ago

@michael-sutherland Just another idea: if you look at tomoscan variables you can see a set matching the dxfile definition that is common to all tomography scans as defined in the tomoscan epics base class. Beamline specific variables are defined here and again for each epics PV (i.e. $(P)$(R)UserName) there is a unique python variable (user_name) as defined in dxfile. Perhaps these can be a starting point?

djvine commented 4 years ago

Most lab based CT systems don't use a mono to select the energy, we use the characteristic line to get a quasi-monochromatic beam, so rigidly defining the energy that way doesn't make sense for that use case.

/Measurement/energy is more generic in my opinion.

On Mon, Jun 1, 2020, 5:26 PM Francesco De Carlo notifications@github.com wrote:

@michael-sutherland https://github.com/michael-sutherland Just another idea: if you look at tomoscan https://tomoscan.readthedocs.io/en/latest/_modules/tomoscan/tomoscan.html#TomoScan variables you can see a set matching the dxfile definition that is common to all tomography scans as defined in the tomoscan epics base class https://tomoscan.readthedocs.io/en/latest/tomoScanApp.html#base-class-files. Beamline specific variables are defined here https://tomoscan.readthedocs.io/en/latest/tomoScanApp.html#beamline-specific-files and again for each epics PV (i.e. $(P)$(R)UserName) there is a unique python variable (user_name) as defined in dxfile. Perhaps these can be a starting point?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/data-exchange/dxchange/issues/47#issuecomment-637196246, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABLDBWCT2QJMR4ZRZA3IW3LRURBKRANCNFSM4C4NFY2Q .

michael-sutherland commented 4 years ago

The dxfile definition seems fairly complete, although it looks like we want to remove the full path and just use the last name there? I'm basing that on the comments above and the preliminary read_dx_meta definition. I feel like we could take the dxfile definition and generate a table of (name, full_name, units, description) to be used as the standard for metadata for dxchange. That way the other data types will be compatible with dxfile.

Another quick issue, should we be returning (projs, flats, darks, metadata) or (projs, flats, darks, thetas, metadata)? I know some exchange readers also return (projs, flats, darks, thetas) or other combinations.

Here's my proposal: Unless there is a good reason to be different, the standard formatting should be: Reader return format: (projs, flats, darks, thetas, metadata). Metadata: follow dxfile without path If practical, create separate metadata reader like read_dx_meta (very useful for planning tomography computation and memory requirements).

decarlof commented 4 years ago

it is a very good plan. We start to have many more instrument doing interlaced scans => they always have theta => I would return theta (and handle the case is missing using the meta data for start angle, step size number of projections, and if also the meta data are missing generate an equally spaced array 0-180 counting the projections).

michael-sutherland commented 4 years ago

OK. I just extracted the variables. There are a lot of repeats for things like "name". Also, there's a bunch of inconsistencies for the trailing slash on the root parameters (minor, I know). I decided to not post the list here, as it is rather long. Here's the code to extract:

from dxfile.dxtomo import Entry

entries = [k for k in Entry.__dict__.keys() if not k.startswith('_')]

print('Entry|Full Path|Units|Docstring')
for entry in entries:
    entry_cls = Entry.__dict__[entry]
    path = entry_cls.root + "/" + entry_cls.entry_name
    names = [name for name in entry_cls.__dict__.keys() if name not in ("root", "entry_name", "docstring") and not name.startswith("_")]
    for name in names:
        name_dict = entry_cls.__dict__[name]
        #print('"'+name+'","'+ path+"/"+name+'","'+ str(name_dict.get('units', ''))+'","'+ str(name_dict.get('docstring', ''))+'"')
        print(name+'|'+ path+"/"+name+'|'+ str(name_dict.get('units', ''))+'|'+ str(name_dict.get('docstring', '')))

Is there a good place to post the extracted list for reference?

decarlof commented 4 years ago

@michael-sutherland please post at https://tinyurl.com/dxchange

the repeats are an issue as some info is in the path. Does it make sense to auto-generate path_name?

michael-sutherland commented 4 years ago

Some of the units are interesting. Like using meter for pixel_size. Also, all stages are in units of mm, but I've always seen stages in units of microns (might be a nano-tomography thing). Also, degrees for the angle.

michael-sutherland commented 4 years ago

@decarlof I posted it there (as rst and csv file).

A lot of the parameters need the context of the full path. It might be nice to "cherry pick" some for the root level - like a generic energy. Dxchange is only looking at "measurement" and "process" right now, although that still has conflicts. For me, I usually only look at a few parameters for parallel beam:

sample stage positions
pixel size
shifts (if available)
binning
image size (If I can load metadata separate from images)

decarlof commented 4 years ago

I think there is some history there ... when missing it was defaulted to m. Same for the angles, in fact first things tomopy does is converting in rads ... I also heard arguments for energy should be J .or keV? I am OK changing the default definition, what really matters is that when implemented the unit is specified and the readers knows how to handle it.

michael-sutherland commented 4 years ago

I agree. Consistency is the most important. It might be good to pick a unit length (maybe mm in this case) and use it everywhere so that it is easier. The thetas that is returned outside of metadata should be in units of radians. Also, any vector objects in the metadata object should probably be numpy arrays (not lists or tuples).

decarlof commented 4 years ago

Just to clarify the

when implemented

I have seen rare use of the dxtomo class for example tomobank is using directly to convert in dxfile other facility formats, but in other implementation like area detector writer is used just as a definition to create the correct xml file. Probably what we should have done is to automatically generate the areadetector xml from the dxfile class ....

michael-sutherland commented 4 years ago

Ugh... slicing will be an issue. If we do slicing for proj, then we need to also slice any vectors in the metadata file. This is where a pandas approach would come in handy...

xray-imaging commented 4 years ago

@michael-sutherland totally agree "name" is an issue and I support the change of 'name' to 'group name'_name e.g. attenuator/name => attenuator/attenuator_name. I need to confirm with other instruments then I will update docs and xml files for area detector. Will @michael-sutherland or @djvine update the dxtomo class?

decarlof commented 4 years ago

I implemented the changes on 2-BM, 7-BM and 32-ID areadetector xml. Waiting on @MarkRivers to review/confirm pull request so the changes are also at 13-BM.

data-exchange / dxchange

Standardize output API for the different readers #47