Open jonwright opened 2 months ago
This is a big macro-task, so here's the list of individual tasks as I see it currently:
Additionally:
Thinking on this some more:
At the moment, geometry and phase parameters are linked in a .par file.
This allows us to import them into a GUI and into our code with xfab.parameters
and immediately compute things like ringds
etc.
We are currently in need of a few features:
xfab.parameters.loadparameters
etc)I propose something like the following:
geometry.par
- contains only the geometric parameters
phases.col
- a colfile
of phases that we could encounter
parameters.h5
- an H5 file that points to/contains in some way the geometry and phases
We could write a check in xfab.parameters.loadparameters
that checks for H5, and assembles the pars from geometry.par
and phases.col
(could just take the first row of phases.col
) for backwards compatability
The phases colfile could be formatted like so:
# phase_id a b c alpha beta gamma pg/sg
0 4.0 4.0 4.0 90.0 90.0 90.0 225
1 3.2 3.2 3.2 90.0 90.0 90.0 229
I think this approach meets all our requirements.
The benefit of a consistent phase_id
is that you can simply add a new one if you want to index with a slightly different unit cell (or symmetry), and we can keep track of our phases throughout.
We could add then phase_id
as an optional parameter to ImageD11.unitcell.unitcell
, therefore when a grain
object is created with ref_unitcell
set, it knows what phase ID it has.
We already have ImageD11.unitcell.cellfromstring
and ImageD11.unitcell.unitcell.tostring
to handle string I/O too!
Consistent phase ID support is I believe required for me to merge my TensorMap
changes (as I need consistent grain names which depends on phase ID)
What do you think?
Sounds promising. A few quick thoughts:
When we load an "old" parameters file, we should call "unitcell_from_parameters" on the parameters object so that old files come with one phase (only) by default.
Phases need a "name", for example, "austenite", "L3", etc rather than an "id". The integer id will annoy people. A phase map of integers is going to need a dictionary that gives the phase names back anyway. If someone adds or removes or edits a phase while working, future they will not want to mess about renumbering.
HDF5 does not seem to be the right solution to me? Something like a toml or json or ini file could be better. Ideally, we will want versioning for parameters, as you often want to compare the effect of changing something.
It would be great to add a small database of common unit cells into ImageD11/data. At least silicon, CeO2, LaB6 and the common metals and rocks. If it is just unit cell + space group number, this can be quite compact.
Phases need a "name", for example, "austenite", "L3", etc rather than an "id". The integer id will annoy people. A phase map of integers is going to need a dictionary that gives the phase names back anyway. If someone adds or removes or edits a phase while working, future they will not want to mess about renumbering.
100% agree, this will nicely act as a dict key
HDF5 does not seem to be the right solution to me? Something like a toml or json or ini file could be better. Ideally, we will want versioning for parameters, as you often want to compare the effect of changing something.
Let's go with json as it's built-in to Python. Versioning is tricky, the simplest I can think of is this format, what do you think?
{
"versions": {
"v1": {
"geometry": "geometry.par",
"phases": "phases.col"
},
"v2": {
"geometry": "geometry.par",
"phases": "phases.col"
}
}
}
Or we could remove the phases colfile altogether and use the JSON directly (I prefer this actually).
Then inside each phase .par we just have the first few cell__
lines
{
"versions": {
"v1": {
"geometry": "geometry.par",
"phases": {
"ferrite": "ferrite.par",
"austenite": "austenite.par"
}
},
"v2": {
"geometry": "geometry.par",
"phases": {
"ferrite": "ferrite.par",
"austenite": "austenite.par",
"epsilon": "epsilon.par"
}
}
}
}
It would be great to add a small database of common unit cells into ImageD11/data. At least silicon, CeO2, LaB6 and the common metals and rocks. If it is just unit cell + space group number, this can be quite compact.
Great idea! They'll be tiny files anyway.
For versioning it might be a separate issue. I think we need something like an md5 in there to validate version of a parameter file (does this match what is on disk?) Then just save a backup when we save a new version. In VMS (or using GSAS) you get files with names like geometry.par_00, geometry.par_01, etc.
There is some related code in the old "project" folder: https://github.com/FABLE-3DXRD/ImageD11/blob/master/ImageD11/depreciated/project/test_json.py
It was expecting diffractometer geometry to come as well.
Something like this?
{
"geometry": "geometry.par",
"phases": {
"ferrite": "ferrite.par",
"austenite": "austenite.par"
},
"phase_hashes": {
"ferrite": "MD5HASH1",
"austenite": "MD5HASH2"
},
}
A few updates on this.
From my perspective we have sorted the requirements for how the multiphase files live on disk. My current preferred method is this:
{
"geometry": {
"file":"geometry.par",
"hash": "MD5HASHGEO"
},
"phases": {
"ferrite": {
"file": "ferrite.par",
"hash": "MD5HASH1"
},
"austenite": {
"file": "austenite.par",
"hash": "MD5HASH2"
}
}
}
I've now focused on what our in-code requirements are, which I have outlined:
pars.json
should be readable by parameters.loadparameters
, yielding a parameters()
object as usualxfab.parameters
ImageD11
, so we can't directly have unitcell
instancesunitcell
objects from this fileImageD11.unitcell
too to handle thisTherefore, I have arranged things in the following way:
xfab.parameters
- JsonPars
Relevant commit: https://github.com/jadball/xfab/commit/37cdcb7a98d37a5ebe16e877fed667b43ca10363
Features:
self.load_json()
- Parses json files into self.json_dict
self.get_pars_objects()
- Looks inside self.json_dict
, reads the individual .par
files, creates a parameters()
object for the geometry (self.geometry_pars_obj
) and one parameters()
object for each phase (self.phase_pars_obj_dict
)self.xfab_pars_dict()
property - Combines self.geometry_pars_obj
and one of the phase objects to form an overall pars
dictparameters.loadparameters(filename)
- checks if filename
ends with .json
, and uses new JsonPars
class to import it via JsonPars.xfab_pars_dict
ImageD11.unitcell
- Phases
Relevant commit - https://github.com/jadball/ImageD11/commit/c9e998278be7c0a222ebe5cbe44005bc8641de95
Features:
JsonPars
, so we're only defining the IO processes in one placeself.unitcells()
- a dict
of unitcell
objects keyed by phase nameself.get_unitcells()
- takes Json.phase_pars_obj_dict
and makes a unitcell
for each parameters()
object using ImageD11.unitcell.unitcell_from_parameters
I think we still need the following things to make this complete, which I can work on if you're happy with this general approach:
parameters.loadparameters()
, we need a way of specifying a phase name to take - should be simple to add to the xfab methodcf.parameters()
, we need a way of keeping track which parameters it is using. To solve this, I propose adding a phase_name
parameter to parameters()
more broadly. This way, when we call parameters.loadparameters(phase='ferrite')
, an extra field ends up in the pars.Phases.get_unitcell_from_pars()
which takes the phase_name
key and looks inside Phases.unitcells
for itindexer_from_colfile
, we keep track of the phase name. We can then give that to each grain
objectPhases
object as an argument - when the grains are read, their phase keys are used to look for unitcell
objects for each - so we can automatically set the reference unit cells for each grainSorry for the huge comment - please let me know your thoughts!
Sorry, I'm a bit out of the loop for this week.
In general, I would prefer to avoid touching xfab because it means both packages need to upgrade together. Having something in ImageD11 only is just easier while developing.
For now I'm wondering where/how these JSON files will live on disk. It seems to fit for holding a whole analysis project. Also as the drivers for an ewoks pipeline. Probably becomes clear with some examples?
We might use the same phases and parameters for processing a series of different samples...
In order to import the new json transparently when calling parameters.loadparameters()
, I will have to modify the classmethod inside xfab - as we moved parameter handling into xfab, I don't see a way to achieve this without modifying xfab eventually.
For now, we could temporarily overwrite the imported parameters object inside ImageD11.parameters
(or stop importing from xfab) so all the proposed changes are inside one repo?
I worry that extending the capabilities of the json too far at this stage will slow this development down, and I have future code (TensorMap) waiting to be merged that needs well-established phases to work well. If you can write a schema/example of a json file that is expandable in the way you want (so we can add features to it in the future), I can modify the current json parsing accordingly?
all the proposed changes are inside one repo?
Yes please! It simplifies the install from git script and debugging broken environments (only one thing to add to sys.path). For now, it just means copying the parameters.py file back into ImageD11 and/or monkeypatching inside ImageD11.parameters. In the longer term, we can merge the changes back into xfab if that makes sense.
Another random comment: I am not keen on "Json" in a class name. The class/object can be serialised into another object storage format, so the class/object (dict of phases?) does not depend on the choice of format. We should be able to load/save from different formats and get the same result back.
Try to merge something and we can go on from there...
Significant progress made in #322 for this - new schema can be modified or extended to contain more processing parameters (right now it's just geometry and phases...)
As discussed with @jadball, it would be good to have a way to save the data processing steps done together with the output. Also to clean up the command line arguments (etc).
One idea was:
Historically, peaksearch.py was supposed to print the options used on stdout. Ideally, you would want that to be a "reproducible" block of code for re-running the job. The old Tk gui had something for this in help->history. For most of our jobs there is:
The main problem to "fix" is making a convenient way to compare processing runs with different sets of parameters. But without needing to repeat steps that are already done.
The NXprocess has an NXnote which seems quite flexible to hold just about anything. Within sinograms this could mean setting up the files as a 'chain' with the "sequence_index" https://manual.nexusformat.org/classes/base_classes/NXprocess.html. They can also be entries (==hdf group) in one big file. For example:
When you load grains, the code would work back along the chain -> columnfile -> peakstable -> sparsefile -> dataset -> masterfile, etc.
This means changing a bunch of things. The dataset object could be simplified so that it does not need to handle every kind of output or processing that comes later. Each new output that comes up just has it's own output file, and a pointer to locate the input data that was used to create it.