FABLE-3DXRD / ImageD11

ImageD11 is a python code for identifying individual grains in spotty area detector X-ray diffraction images.
https://imaged11.readthedocs.io/
GNU General Public License v2.0
15 stars 25 forks source link

`ImageD11.columnfile.columnfile` fails to synchronize across `__setattr__` and `__getitem__` calls #289

Closed AxelHenningsson closed 5 months ago

AxelHenningsson commented 5 months ago

Problem

While the ImageD11.columnfile.columnfile object promise to mutate my data in place 👍

colf = ImageD11.columnfile.colfile_from_dict( {'data': np.random.rand(4,)} )

# mutation by in-place-operation - OK!
colf.data *= 10
assert np.allclose(colf['data'], colf.data)

Mutation by reference will silently fail 😨

colf = ImageD11.columnfile.colfile_from_dict( {'data': np.random.rand(4,)} )

# mutation by pointer mutation - OK!
colf.data = np.ones_like(colf.data)
assert np.allclose(colf['data'], colf.data)

Reason

The ImageD11.columnfile.columnfile object lacks an implementation of __setattr__ and __setitem__ that will mutate the underlying hidden data structures. I.e something like

class columnfile(object):

    def __init__(self):
        ...

    def __setattr__(self, key, val):
        # Set the internal structures and then....
        super(columnfile, self).__setattr__(key, val)

    def __setitem__(self, key, val):
        # Set the internal structures and then....
        super(columnfile, self).__setattr__(key, val)

The Good Solution

Implement columnfile as a pandas dataframe. The pandas.DataFrame is basically columnfile with a bunch of stuff already implemented and extended. It is a well maintained, standard and fast library. Mutations will then work out of the box.

import pandas

colf = pandas.DataFrame({'data': np.random.rand(4,)})

colf.data = np.ones_like(colf.data)
assert np.allclose(colf['data'], colf.data)

The Bad Solution

Overwrite and __setattr__ and __setitem__ and call setcolumn internally before attribute assignment.

The Worst Solution

Make ImageD11.columnfile.columnfile throw an error whenever class members are being mutated and refer to addcolumn or setcolumn for data mutation.

jonwright commented 5 months ago

@jadball has a panda's implementation. I don't want to merge something until we have a back end for polars and/or dask distributed too.

In the meantime, I guess it is easy to patch the setattr/setitem to get what you want?

AxelHenningsson commented 5 months ago

Fantastic @jadball!

I suggest we push the pandas implementation as a new class while we a deprecating the old column file - this is also helpful for smoking out any bugs.

@jon Is there not already such a support in dask and polars?

It is the silent part that worries me. I would suggest that we at least raise an error for things that are not implemented.

Anyways - the two of you are way more core than I on this one - I fully trust your decision.

Cheers Axel

jonwright commented 5 months ago

@jadball code is here https://github.com/jadball/ImageD11/tree/colfile_pandas

There are a bunch of other dataframe libraries listed here : https://github.com/data-apis/dataframe-api/blob/main/spec/purpose_and_scope.md . Seems to be unclear what is compatible or not.

There may also be arguments for looking at xarray https://docs.xarray.dev/en/stable/getting-started-guide/why-xarray.html, for example, to make g-vectors be a (nx3) vector type, etc. And add units.

For me, it is not clear whether we need a new object deprecating columnfile (means changing a LOT) of old code.

Subclassing columnfile to have PandasColumnfile, etc looks promising right now.

jadball commented 5 months ago

I will work on updating my fork to make pandas_columnfile a subclass of columnfile, then we'll try to open a PR for it

jadball commented 5 months ago

PR merged #290