NNPDF / eko

Evolution Kernel Operators
https://eko.readthedocs.io
GNU General Public License v3.0
6 stars 2 forks source link

Consider external serialization libraries #180

Open alecandido opened 1 year ago

alecandido commented 1 year ago

At the moment, I wrote in #172 a mini-serialization library based on dataclasses, it is currently called DictLike.

Now, dataclass already provides natively .asdict() and .from_dict() analagoue (i.e. MyDataClass(**mydict)). This is the initial motivation to go down this way.

It is optimal, since it allows us to go through serialization preserving type information, since it is in the runtime structure, and this is the new instance of the "input layer" currently implemented in yadism. The idea is to improve over and over input-checking relying on custom types definition, such that the internal library can make assumptions on those values.

Unfortunately, type hints and runtime classes are not the exact same thing, and proper serialization with generic type hints is complex (consider that a union is a type hint, not a class, including MyType | None, formerly typing.Optional[MyType], and also list[int], formerly typing.List[int]). So, we already taking care of internal types is not coming at zero cost, especially because many features are introduced in later releases, and we need compatibility with py3.8 at the time of writing. This reflects the fact that types in Python are rather recent, with all their ecosystem (of which dataclass is part).

In order to reduce the burden of maintenance of the serialization part, it is worth to consider external libraries, especially if they are popular enough. A good example would be lidatong/dataclasses-json, that provides a @dataclass_json decorator that is the equivalent of my DictLike base class. It is also compatible with py3.7 and py3.6 (with backport of dataclasses, introduced in standard library in py3.7). Unfortunately, it doesn't look so lively.

Other options are welcome.

alecandido commented 1 year ago

Much better, pydantic seems to do it, so we might get it together with #177

alecandido commented 1 year ago

https://www.attrs.org/en/stable/why.html

attrs is another candidate, together with its companion cattrs.

alecandido commented 1 year ago

https://threeofwands.com/why-i-use-attrs-instead-of-pydantic/

This actually convinced me that we want Pydantic, rather than attrs, since we want to validate runcards. And if someone saves me the burden to write the validator (or part of it), I'm only grateful.

Fatal1ty commented 1 year ago

Hi, @AleCandido

In order to reduce the burden of maintenance of the serialization part, it is worth to consider external libraries, especially if they are popular enough.

This is a good idea especially if you’re interested in serialization of variadic generics from the new PEP 646. It has a lot of edge cases but I managed to cope with them in mashumaro that I would suggest you to try even if you have more common use cases.

alecandido commented 1 year ago

@Fatal1ty another rather nasty type that requires some special care here is npt.NDArray. Have you ever tested it?

Fatal1ty commented 1 year ago

@Fatal1ty another rather nasty type that requires some special care here is npt.NDArray. Have you ever tested it?

For non-standard types like this one a custom universal serialization strategy can be registered. But since npt.NDArray isn't equal to np.ndarray such a strategy should be registered for any scalar type at the moment:

class NDArraySerializationStrategy(SerializationStrategy):
    def serialize(self, value: np.ndarray) -> str:
        tmp_io = io.BytesIO()
        np.save(tmp_io, value, allow_pickle=False)
        return tmp_io.getvalue().hex()

    def deserialize(self, value: str) -> np.ndarray:
        tmp_io = io.BytesIO(bytes.fromhex(value))
        return np.load(tmp_io, allow_pickle=False)

@dataclass
class C(DataClassDictMixin):
    x: npt.NDArray[np.float64]

    class Config(BaseConfig):
        serialization_strategy = {
            npt.NDArray[float64]: NDArraySerializationStrategy(),
        }

This could be unhandy when you have a lot of scalar variations, so I'm thinking about allowing to set a strategy for the origin type — np.ndarray in this case.

felixhekhorn commented 1 month ago

a commit in an earlier attempt is https://github.com/NNPDF/eko/commit/c6c2fa667c4a750e2bdf10b0e0d1997f84772b27