FABLE-3DXRD / ImageD11

ImageD11 is a python code for identifying individual grains in spotty area detector X-ray diffraction images.
https://imaged11.readthedocs.io/
GNU General Public License v2.0
15 stars 25 forks source link

Saving processing parameters & Nexus output, etc #318

Open jonwright opened 2 months ago

jonwright commented 2 months ago

As discussed with @jadball, it would be good to have a way to save the data processing steps done together with the output. Also to clean up the command line arguments (etc).

One idea was:

Historically, peaksearch.py was supposed to print the options used on stdout. Ideally, you would want that to be a "reproducible" block of code for re-running the job. The old Tk gui had something for this in help->history. For most of our jobs there is:

The main problem to "fix" is making a convenient way to compare processing runs with different sets of parameters. But without needing to repeat steps that are already done.

The NXprocess has an NXnote which seems quite flexible to hold just about anything. Within sinograms this could mean setting up the files as a 'chain' with the "sequence_index" https://manual.nexusformat.org/classes/base_classes/NXprocess.html. They can also be entries (==hdf group) in one big file. For example:

When you load grains, the code would work back along the chain -> columnfile -> peakstable -> sparsefile -> dataset -> masterfile, etc.

This means changing a bunch of things. The dataset object could be simplified so that it does not need to handle every kind of output or processing that comes later. Each new output that comes up just has it's own output file, and a pointer to locate the input data that was used to create it.

jadball commented 2 months ago

This is a big macro-task, so here's the list of individual tasks as I see it currently:

Additionally:

jadball commented 2 months ago

Thinking on this some more:

At the moment, geometry and phase parameters are linked in a .par file. This allows us to import them into a GUI and into our code with xfab.parameters and immediately compute things like ringds etc. We are currently in need of a few features:

I propose something like the following:

geometry.par - contains only the geometric parameters phases.col - a colfile of phases that we could encounter parameters.h5 - an H5 file that points to/contains in some way the geometry and phases

We could write a check in xfab.parameters.loadparameters that checks for H5, and assembles the pars from geometry.par and phases.col (could just take the first row of phases.col) for backwards compatability

The phases colfile could be formatted like so:

# phase_id a b c alpha beta gamma pg/sg
0 4.0 4.0 4.0 90.0 90.0 90.0 225
1 3.2 3.2 3.2 90.0 90.0 90.0 229

I think this approach meets all our requirements. The benefit of a consistent phase_id is that you can simply add a new one if you want to index with a slightly different unit cell (or symmetry), and we can keep track of our phases throughout. We could add then phase_id as an optional parameter to ImageD11.unitcell.unitcell, therefore when a grain object is created with ref_unitcell set, it knows what phase ID it has. We already have ImageD11.unitcell.cellfromstring and ImageD11.unitcell.unitcell.tostring to handle string I/O too!

Consistent phase ID support is I believe required for me to merge my TensorMap changes (as I need consistent grain names which depends on phase ID)

What do you think?

jonwright commented 2 months ago

Sounds promising. A few quick thoughts:

When we load an "old" parameters file, we should call "unitcell_from_parameters" on the parameters object so that old files come with one phase (only) by default.

Phases need a "name", for example, "austenite", "L3", etc rather than an "id". The integer id will annoy people. A phase map of integers is going to need a dictionary that gives the phase names back anyway. If someone adds or removes or edits a phase while working, future they will not want to mess about renumbering.

HDF5 does not seem to be the right solution to me? Something like a toml or json or ini file could be better. Ideally, we will want versioning for parameters, as you often want to compare the effect of changing something.

It would be great to add a small database of common unit cells into ImageD11/data. At least silicon, CeO2, LaB6 and the common metals and rocks. If it is just unit cell + space group number, this can be quite compact.

jadball commented 2 months ago

Phases need a "name", for example, "austenite", "L3", etc rather than an "id". The integer id will annoy people. A phase map of integers is going to need a dictionary that gives the phase names back anyway. If someone adds or removes or edits a phase while working, future they will not want to mess about renumbering.

100% agree, this will nicely act as a dict key

HDF5 does not seem to be the right solution to me? Something like a toml or json or ini file could be better. Ideally, we will want versioning for parameters, as you often want to compare the effect of changing something.

Let's go with json as it's built-in to Python. Versioning is tricky, the simplest I can think of is this format, what do you think?

{
    "versions": {
        "v1": {
            "geometry": "geometry.par",
            "phases": "phases.col"
        },
        "v2": {
            "geometry": "geometry.par",
            "phases": "phases.col"
        }
    }
}

Or we could remove the phases colfile altogether and use the JSON directly (I prefer this actually). Then inside each phase .par we just have the first few cell__ lines

{
    "versions": {
        "v1": {
            "geometry": "geometry.par",
            "phases": {
                "ferrite": "ferrite.par",
                "austenite": "austenite.par"
            }
        },
        "v2": {
            "geometry": "geometry.par",
            "phases": {
                "ferrite": "ferrite.par",
                "austenite": "austenite.par",
                "epsilon": "epsilon.par"
            }
        }
    }
}

It would be great to add a small database of common unit cells into ImageD11/data. At least silicon, CeO2, LaB6 and the common metals and rocks. If it is just unit cell + space group number, this can be quite compact.

Great idea! They'll be tiny files anyway.

jonwright commented 2 months ago

For versioning it might be a separate issue. I think we need something like an md5 in there to validate version of a parameter file (does this match what is on disk?) Then just save a backup when we save a new version. In VMS (or using GSAS) you get files with names like geometry.par_00, geometry.par_01, etc.

There is some related code in the old "project" folder: https://github.com/FABLE-3DXRD/ImageD11/blob/master/ImageD11/depreciated/project/test_json.py

It was expecting diffractometer geometry to come as well.

jadball commented 2 months ago

Something like this?

{
 "geometry": "geometry.par",
 "phases": {
     "ferrite": "ferrite.par",
     "austenite": "austenite.par"
 },
 "phase_hashes": {
     "ferrite": "MD5HASH1",
     "austenite": "MD5HASH2"
 },
}
jadball commented 2 months ago

A few updates on this.

From my perspective we have sorted the requirements for how the multiphase files live on disk. My current preferred method is this:

{
  "geometry": {
    "file":"geometry.par",
    "hash": "MD5HASHGEO"
  },
  "phases": {
    "ferrite": {
      "file": "ferrite.par",
      "hash": "MD5HASH1"
    },
    "austenite": {
      "file": "austenite.par",
      "hash": "MD5HASH2"
    }
  }
}

I've now focused on what our in-code requirements are, which I have outlined:

Therefore, I have arranged things in the following way:

Added new class to xfab.parameters - JsonPars

Relevant commit: https://github.com/jadball/xfab/commit/37cdcb7a98d37a5ebe16e877fed667b43ca10363

Features:

Added new class to ImageD11.unitcell - Phases

Relevant commit - https://github.com/jadball/ImageD11/commit/c9e998278be7c0a222ebe5cbe44005bc8641de95

Features:

I think we still need the following things to make this complete, which I can work on if you're happy with this general approach:

Sorry for the huge comment - please let me know your thoughts!

jonwright commented 2 months ago

Sorry, I'm a bit out of the loop for this week.

In general, I would prefer to avoid touching xfab because it means both packages need to upgrade together. Having something in ImageD11 only is just easier while developing.

For now I'm wondering where/how these JSON files will live on disk. It seems to fit for holding a whole analysis project. Also as the drivers for an ewoks pipeline. Probably becomes clear with some examples?

We might use the same phases and parameters for processing a series of different samples...

jadball commented 2 months ago

In order to import the new json transparently when calling parameters.loadparameters(), I will have to modify the classmethod inside xfab - as we moved parameter handling into xfab, I don't see a way to achieve this without modifying xfab eventually. For now, we could temporarily overwrite the imported parameters object inside ImageD11.parameters (or stop importing from xfab) so all the proposed changes are inside one repo?

I worry that extending the capabilities of the json too far at this stage will slow this development down, and I have future code (TensorMap) waiting to be merged that needs well-established phases to work well. If you can write a schema/example of a json file that is expandable in the way you want (so we can add features to it in the future), I can modify the current json parsing accordingly?

jonwright commented 2 months ago

all the proposed changes are inside one repo?

Yes please! It simplifies the install from git script and debugging broken environments (only one thing to add to sys.path). For now, it just means copying the parameters.py file back into ImageD11 and/or monkeypatching inside ImageD11.parameters. In the longer term, we can merge the changes back into xfab if that makes sense.

Another random comment: I am not keen on "Json" in a class name. The class/object can be serialised into another object storage format, so the class/object (dict of phases?) does not depend on the choice of format. We should be able to load/save from different formats and get the same result back.

Try to merge something and we can go on from there...

jadball commented 1 month ago

Significant progress made in #322 for this - new schema can be modified or extended to contain more processing parameters (right now it's just geometry and phases...)