serialization - Githubissues

hartytp commented 1 year ago

To integrate with ndscan we need to be able to serialize our Model + Fitter into a string so it can be broadcast / saved in the H5 file. As an example of why this matters, we need to be able to tell the applet how to run our fit.

For the Fitter I think this is unproblematic. Fitters aren't constructed until we actually do the fit, so all we need to store in the string representation is the python module which the Fitter resides in and its class name.

models have a bit more state which complicates things. The Model string representation needs the parameters dictionary as well as the model, class and any other user-supplied configuration data (e.g. polynomials have a degree argument).

@mbirtwell any advance on the following proposal:

we make Model a dataclass. Then we'd probably move the current constructor into a __post_init__ method. Any additional configuration method, such as the polynomial degree, would then also be stored as class attributes. At that point, I think we can generate a suitable string representation using something like the following pseudocode:

def module_to_str(model: Model) -> str:
    repr = dataclasses.asdict(Model)
    repr["class"] = model.__class__.__name__
    repr["module"] = model.__class__.__module__
    repr["parameters"] = {param: dataclasses.asdict(param_data) for param, param_data in parameters}
    return str(repr)

def decode(repr: str) -> Model:
    dict_repr = dict(repr)
    module_name = dict_repr.pop("module")
    class_name = dict_repr.pop("class")
    dict_repr["parameters"] = ...

mbirtwell commented 1 year ago

Any additional configuration method, such as the polynomial degree, would then also be stored as class attributes.

Do you mean instance attributes?

I don't think str is a suitable serialisation function. Nor would the builtin repr be, which is basically equivalent here because dict's str is the same as it's repr and calls repr on the contain objects. But repr isn't really intended to be a serialisation format it's encouraged that it generates valid python but it doesn't necessarily. Using pyon/json here would be better and is maybe what you meant. But I think that these will both struggle with the functions stored on the scale attribute of ModelParameters. We'd have to do something custom there and that would be annoying.

pickle would also struggle with the scale function as they stand, but could be made to work if we made all the scale functions top level functions, which wouldn't be too difficult.

dill however seems like it might just work out of the box for these types. So I would use that. Dill produces binary data, so we might need to encode it to get a clean transfer:

def dumps(obj: Any) -> str:
    return base64.b64encode(dill.dumps(obj))

def loads(data: str) -> Any:
    return dill.loads(base64.b64decode(data))

hartytp commented 1 year ago

Do you mean instance attributes?

Sorry, yes! They just look like class attributes in the definition of the data class

hartytp commented 1 year ago

But I think that these will both struggle with the functions stored on the scale attribute of ModelParameters. We'd have to do something custom there and that would be annoying.

Is this a problem?

My thinking was that we only serialise the dataclass fields (including the parameters dictionary) + the module name and class. On the deserializing end we do something like

module = importlib.import_module(schema.pop("module"))
cls = getattr(module, schema.pop("class"))
model = cls(*schema)  # remaining entries in schema should just be the dataclass fields

mbirtwell commented 1 year ago

That still looks like it's missing some bits to me. Perhaps not very complicated, bits. Maybe you need to try it, so that we can see some more fleshed out code.

Presumably this means your happy that people can't customise the scale functions on ModelParameters, if they do it will work in the experiment but not the applet.

Also what's the advantage of your approach?

hartytp commented 1 year ago

Presumably this means your happy that people can't customise the scale functions on ModelParameters, if they do it will work in the experiment but not the applet.

That's true, but I can't think of a situation where one changes the scale function without writing a new model as they are so closely coupled.

Also what's the advantage of your approach?

Compared with dill you mean? Good question. I don't have a strongly held view on this.

The proposal here felt more in keeping with ndscan which generally serializes objects into dictionaries of strings and I already have a sketch of code that does essentially this. But that's not a great reason either.

hartytp commented 1 year ago

Presumably this means your happy that people can't customise the scale functions on ModelParameters, if they do it will work in the experiment but not the applet.

That's true, but I can't think of a situation where one changes the scale function without writing a new model as they are so closely coupled.

~aah, no, I take that back! You're right: to serialize a ModelParameter we'd need to serialize the scale function, which is non-trivial if we're using a string representation, so the way I've done this in ndscan in the past won't work trivially here.~ Edit: see below

It's still not clear to me that the dill proposal above works trivially without invasive changes to ndscan. IIRC on the applet side it uses JSON to convert strings into dictionaries of strings. So, at a minimum I think we'd need to escape quotation marks, right? Edit: I guess we can still do this pretty easily with something like re.escape, but it's not totally obvious to me which method is going to work out easier when all is said and done...

hartytp commented 1 year ago

aah, no, I take that back! You're right: to serialize a ModelParameter we'd need to serialize the scale function, which is non-trivial if we're using a string representation, so the way I've done this in ndscan in the past won't work trivially here.

Well, maybe it's not quite that bad. I guess we just have an explicit contract that the scale functions cannot be changed from the model defaults. Then we don't include them in the serialized ModelParameter and recover them from the model class...

hartytp commented 1 year ago

@mbirtwell I tried this out both ways and came around to your suggestion. Converting objects to dictionaries of strings freezes in a lot of implementation details that one ideally wants to keep open. e.g if we're going to convert a ModelParameter to a dictionary of strings, how should we deal with ModelParameter subclasses? I think we'd need to store a module and class for each ModelParameter so we can import any subclasses and everything gets rather messy.

OxIonics / ionics_fits

serialization #28