Saving processing parameters & Nexus output, etc

jonwright commented 3 months ago

As discussed with @jadball, it would be good to have a way to save the data processing steps done together with the output. Also to clean up the command line arguments (etc).

One idea was:

store all the job control / processing parameter arguments somewhere in the output file
append any results into the same file (or a new file that copies that input).

Historically, peaksearch.py was supposed to print the options used on stdout. Ideally, you would want that to be a "reproducible" block of code for re-running the job. The old Tk gui had something for this in help->history. For most of our jobs there is:

input + control parameters (+ source filenames or data dependencies)
output results files (peaks / grains / maps / parameter values etc)
output logging information (success / failure / time taken / fit scores / warnings )

The main problem to "fix" is making a convenient way to compare processing runs with different sets of parameters. But without needing to repeat steps that are already done.

The NXprocess has an NXnote which seems quite flexible to hold just about anything. Within sinograms this could mean setting up the files as a 'chain' with the "sequence_index" https://manual.nexusformat.org/classes/base_classes/NXprocess.html. They can also be entries (==hdf group) in one big file. For example:

move the lima_segmenter parameters into the output sparsefile
the sparsefile has a pointer back to the dataset
the peakstables points to the sparsefile
columnfiles point back to the peakstable
grains point to the columnfile

When you load grains, the code would work back along the chain -> columnfile -> peakstable -> sparsefile -> dataset -> masterfile, etc.

This means changing a bunch of things. The dataset object could be simplified so that it does not need to handle every kind of output or processing that comes later. Each new output that comes up just has it's own output file, and a pointer to locate the input data that was used to create it.

jadball commented 3 months ago

This is a big macro-task, so here's the list of individual tasks as I see it currently:

[ ] Move lima_segmenter parameters into output sparsefile
[ ] Add peak merge parameters (do we have any?) to peaks tables
[ ] Decide where to put experimental parameters (sample-detector distance, pixel size, etc)
[ ] Decide where to put unit cell parameters (need multiphase support)
[ ] Add indexing parameters to grains file
[ ] Add sinogram parameters to grains file
[ ] Add point-by-point indexing parameters somewhere
[ ] Add pointers

Additionally:

The output of the point-by-point indexer is effectively a colfile. Therefore, we could use the colfile class to handle this data
We probably want to make use of the colfile HDF5 output for this
Currently colfile HDF5 has no notion of parameters (there is a fixme for this): https://github.com/FABLE-3DXRD/ImageD11/blob/30bd958f160411c38a7da90a3b02d3e4e0e23e8f/ImageD11/columnfile.py#L553-L564

jadball commented 3 months ago

Thinking on this some more:

At the moment, geometry and phase parameters are linked in a .par file. This allows us to import them into a GUI and into our code with xfab.parameters and immediately compute things like ringds etc. We are currently in need of a few features:

[ ] Separate experimental parameters and phase parameters
[ ] Support for multiple phases
[ ] Importable as a single file (for backwards compatibility with xfab.parameters.loadparameters etc)
[ ] Plain-text editable

I propose something like the following:

geometry.par - contains only the geometric parameters phases.col - a colfile of phases that we could encounter parameters.h5 - an H5 file that points to/contains in some way the geometry and phases

We could write a check in xfab.parameters.loadparameters that checks for H5, and assembles the pars from geometry.par and phases.col (could just take the first row of phases.col) for backwards compatability

The phases colfile could be formatted like so:

# phase_id a b c alpha beta gamma pg/sg
0 4.0 4.0 4.0 90.0 90.0 90.0 225
1 3.2 3.2 3.2 90.0 90.0 90.0 229

I think this approach meets all our requirements. The benefit of a consistent phase_id is that you can simply add a new one if you want to index with a slightly different unit cell (or symmetry), and we can keep track of our phases throughout. We could add then phase_id as an optional parameter to ImageD11.unitcell.unitcell, therefore when a grain object is created with ref_unitcell set, it knows what phase ID it has. We already have ImageD11.unitcell.cellfromstring and ImageD11.unitcell.unitcell.tostring to handle string I/O too!

Consistent phase ID support is I believe required for me to merge my TensorMap changes (as I need consistent grain names which depends on phase ID)

What do you think?

jonwright commented 3 months ago

Sounds promising. A few quick thoughts:

When we load an "old" parameters file, we should call "unitcell_from_parameters" on the parameters object so that old files come with one phase (only) by default.

Phases need a "name", for example, "austenite", "L3", etc rather than an "id". The integer id will annoy people. A phase map of integers is going to need a dictionary that gives the phase names back anyway. If someone adds or removes or edits a phase while working, future they will not want to mess about renumbering.

HDF5 does not seem to be the right solution to me? Something like a toml or json or ini file could be better. Ideally, we will want versioning for parameters, as you often want to compare the effect of changing something.

It would be great to add a small database of common unit cells into ImageD11/data. At least silicon, CeO2, LaB6 and the common metals and rocks. If it is just unit cell + space group number, this can be quite compact.

jadball commented 3 months ago

Phases need a "name", for example, "austenite", "L3", etc rather than an "id". The integer id will annoy people. A phase map of integers is going to need a dictionary that gives the phase names back anyway. If someone adds or removes or edits a phase while working, future they will not want to mess about renumbering.

100% agree, this will nicely act as a dict key

HDF5 does not seem to be the right solution to me? Something like a toml or json or ini file could be better. Ideally, we will want versioning for parameters, as you often want to compare the effect of changing something.

Let's go with json as it's built-in to Python. Versioning is tricky, the simplest I can think of is this format, what do you think?

{
    "versions": {
        "v1": {
            "geometry": "geometry.par",
            "phases": "phases.col"
        },
        "v2": {
            "geometry": "geometry.par",
            "phases": "phases.col"
        }
    }
}

Or we could remove the phases colfile altogether and use the JSON directly (I prefer this actually). Then inside each phase .par we just have the first few cell__ lines

{
    "versions": {
        "v1": {
            "geometry": "geometry.par",
            "phases": {
                "ferrite": "ferrite.par",
                "austenite": "austenite.par"
            }
        },
        "v2": {
            "geometry": "geometry.par",
            "phases": {
                "ferrite": "ferrite.par",
                "austenite": "austenite.par",
                "epsilon": "epsilon.par"
            }
        }
    }
}

It would be great to add a small database of common unit cells into ImageD11/data. At least silicon, CeO2, LaB6 and the common metals and rocks. If it is just unit cell + space group number, this can be quite compact.

Great idea! They'll be tiny files anyway.

jonwright commented 3 months ago

For versioning it might be a separate issue. I think we need something like an md5 in there to validate version of a parameter file (does this match what is on disk?) Then just save a backup when we save a new version. In VMS (or using GSAS) you get files with names like geometry.par_00, geometry.par_01, etc.

When we (over-) write a file, we first copy the old one to the next available filename
When we use the parameters (e.g. colfile updateGeometry) it keeps a copy of the parameters it used (and their md5)

There is some related code in the old "project" folder: https://github.com/FABLE-3DXRD/ImageD11/blob/master/ImageD11/depreciated/project/test_json.py

It was expecting diffractometer geometry to come as well.

jadball commented 3 months ago

Something like this?

{
 "geometry": "geometry.par",
 "phases": {
     "ferrite": "ferrite.par",
     "austenite": "austenite.par"
 },
 "phase_hashes": {
     "ferrite": "MD5HASH1",
     "austenite": "MD5HASH2"
 },
}

jadball commented 3 months ago

A few updates on this.

From my perspective we have sorted the requirements for how the multiphase files live on disk. My current preferred method is this:

{
  "geometry": {
    "file":"geometry.par",
    "hash": "MD5HASHGEO"
  },
  "phases": {
    "ferrite": {
      "file": "ferrite.par",
      "hash": "MD5HASH1"
    },
    "austenite": {
      "file": "austenite.par",
      "hash": "MD5HASH2"
    }
  }
}

I've now focused on what our in-code requirements are, which I have outlined:

[ ] pars.json should be readable by parameters.loadparameters, yielding a parameters() object as usual
[ ] The json IO should therefore live inside xfab.parameters
[ ] The json IO therefore can't depend on ImageD11, so we can't directly have unitcell instances
[ ] Nevertheless, we should be able to get a dict of unitcell objects from this file
[ ] So we need something in ImageD11.unitcell too to handle this

Therefore, I have arranged things in the following way:

Added new class to `xfab.parameters` - `JsonPars`

Relevant commit: https://github.com/jadball/xfab/commit/37cdcb7a98d37a5ebe16e877fed667b43ca10363

Features:

self.load_json() - Parses json files into self.json_dict
self.get_pars_objects() - Looks inside self.json_dict, reads the individual .par files, creates a parameters() object for the geometry (self.geometry_pars_obj) and one parameters() object for each phase (self.phase_pars_obj_dict)
self.xfab_pars_dict() property - Combines self.geometry_pars_obj and one of the phase objects to form an overall pars dict
updated parameters.loadparameters(filename) - checks if filename ends with .json, and uses new JsonPars class to import it via JsonPars.xfab_pars_dict

Added new class to `ImageD11.unitcell` - `Phases`

Relevant commit - https://github.com/jadball/ImageD11/commit/c9e998278be7c0a222ebe5cbe44005bc8641de95

Features:

Directly subclasses JsonPars, so we're only defining the IO processes in one place
Has one extra attribute - self.unitcells() - a dict of unitcell objects keyed by phase name
Has one extra method - self.get_unitcells() - takes Json.phase_pars_obj_dict and makes a unitcell for each parameters() object using ImageD11.unitcell.unitcell_from_parameters

I think we still need the following things to make this complete, which I can work on if you're happy with this general approach:

[x] When we call parameters.loadparameters(), we need a way of specifying a phase name to take - should be simple to add to the xfab method
[x] Once we have a columnfile floating around in a notebook with some parameters attached to it via cf.parameters(), we need a way of keeping track which parameters it is using. To solve this, I propose adding a phase_name parameter to parameters() more broadly. This way, when we call parameters.loadparameters(phase='ferrite'), an extra field ends up in the pars.
[x] We can then add a method Phases.get_unitcell_from_pars() which takes the phase_name key and looks inside Phases.unitcells for it
[x] This also means that when we make an indexer via indexer_from_colfile, we keep track of the phase name. We can then give that to each grain object
[x] Down the line, we'll also need to slightly modify how grains are saved to H5, so they're saved inside a group with the phase name as the key
[x] Then when we import grains from H5, we can give an optional Phases object as an argument - when the grains are read, their phase keys are used to look for unitcell objects for each - so we can automatically set the reference unit cells for each grain

Sorry for the huge comment - please let me know your thoughts!

jonwright commented 3 months ago

Sorry, I'm a bit out of the loop for this week.

In general, I would prefer to avoid touching xfab because it means both packages need to upgrade together. Having something in ImageD11 only is just easier while developing.

For now I'm wondering where/how these JSON files will live on disk. It seems to fit for holding a whole analysis project. Also as the drivers for an ewoks pipeline. Probably becomes clear with some examples?

We might use the same phases and parameters for processing a series of different samples...

jadball commented 3 months ago

In order to import the new json transparently when calling parameters.loadparameters(), I will have to modify the classmethod inside xfab - as we moved parameter handling into xfab, I don't see a way to achieve this without modifying xfab eventually. For now, we could temporarily overwrite the imported parameters object inside ImageD11.parameters (or stop importing from xfab) so all the proposed changes are inside one repo?

I worry that extending the capabilities of the json too far at this stage will slow this development down, and I have future code (TensorMap) waiting to be merged that needs well-established phases to work well. If you can write a schema/example of a json file that is expandable in the way you want (so we can add features to it in the future), I can modify the current json parsing accordingly?

jonwright commented 3 months ago

all the proposed changes are inside one repo?

Yes please! It simplifies the install from git script and debugging broken environments (only one thing to add to sys.path). For now, it just means copying the parameters.py file back into ImageD11 and/or monkeypatching inside ImageD11.parameters. In the longer term, we can merge the changes back into xfab if that makes sense.

Another random comment: I am not keen on "Json" in a class name. The class/object can be serialised into another object storage format, so the class/object (dict of phases?) does not depend on the choice of format. We should be able to load/save from different formats and get the same result back.

Try to merge something and we can go on from there...

jadball commented 2 months ago

Significant progress made in #322 for this - new schema can be modified or extended to contain more processing parameters (right now it's just geometry and phases...)

FABLE-3DXRD / ImageD11