NowanIlfideme / pydantic-cereal

Advanced serialization for Pydantic v2
MIT License
5 stars 0 forks source link

Switch the developer API from using URI to (filesystem + path) #19

Closed NowanIlfideme closed 8 months ago

NowanIlfideme commented 8 months ago

Motivation

We use fsspec (and currently universal_pathlib) to specify both the user interface and the developer interface.

Current user interface (from example documentation):

cereal.write_model(mdl, "memory://example_model")

new_mdl = cereal.read_model("memory://example_model", supercls=ExampleModel)

Current developer interface (from example documentation):

def my_reader(uri: str) -> MyType:
    """Read the object from an fsspec URI."""
    return MyType(UPath(uri).read_text())

def my_writer(obj: MyType, uri: str) -> None:
    """Write the object to an fsspec URI."""
    UPath(uri).write_text(obj)

MyWrappedType = cereal.wrap_type(MyType, reader=my_reader, writer=my_writer)

However, this is limiting - there are filesystems that use other filesystems as a base and they can't always be done via the described "URL chaining", as you need to pass in keyword arguments. We can't use open_files because we actually don't know what files we'll be writing/reading ahead of time, so we can't use the OpenFile/OpenFiles interface that fsspec provides.

So, we should instead keep around two references: the filesystem: AbstractFileSystem object (which references some file system) and the path: str (is it str?...) that acts as a "relative (or absolute) path" within that filesystem. That way you can open a filesystem that is a Zip file within a secure cloud storage bucket... and still traverse paths within it.

Additionally, having the filesystem reference allows us to use transactions to avoid many issues with data corruption.

Specification

Reader/Writer API

Reader and Writer should have a different Protocol definition. Something similar to this:

@runtime_checkable
class CerealReader(Protocol[T_read]):
    """Reader class for a particular type."""

    @abstractmethod
    def __call__(self, fs: AbstractFileSystem, path: str) -> T_read:
        """Read data from the given path within the filesystem."""

@runtime_checkable
class CerealWriter(Protocol[T_write]):
    """Writer class for a particular type."""

    @abstractmethod
    def __call__(self, obj: T_write, fs: AbstractFileSystem, path: str) -> Any:
        """Write data to the given path within the filesystem."""

This forces user-defined readers/writers to be a bit more robust, because it is more explicit (users should support fsspec specifically) than the previous one ("ok here's a URI, load it somehow I guess? oh and we use fsspec").

Cereal API

For the Cereal API, we want to allow usage with just a "URI" (as before) but also allow specifying a filesystem directly. So the new function APIs are:

class Cereal(object):
    ...

    def write_model(self, model: BaseModel, target_path: Union[UPath, Path, str], fs: Optional[AbstractFileSystem] = None) -> str:
        """Write the pydantic.BaseModel to the target_path (as a directory).

        Parameters
        ----------
        <todo: describe params>

        Usage
        -----
        If `fs` is given, `target_path` must be a `str` and not contain any protocols (i.e. should be a path within that file system).
        If `fs` is None, we will attempt to infer the filesystem type from the `target_path`.
        If `target_path` is a `str`, we will use `fsspec` logic to load the file system and path.
        If `target_path` is a `Path` (from `pathlib`), a local filesystem will be inferred.
        If `target_path` is a `UPath` (from `universal_pathlib`/`upath`), we will use the given filesystem.
        """

    def read_model(
        self,
        source_path: Union[UPath, Path, str],
        fs: Optional[AbstractFileSystem] = None,
        *,
        supercls: Type[TModel] = BaseModel,  # type: ignore
    ) -> TModel:
        """Read a pydantic.BaseModel from the path.

        <same logic as above>
        """

Note that argument names have been changed - perhaps this isn't the best naming scheme for parameters, but explicit is better than having a short "URI" that could really mean anything to the user. 😉

NowanIlfideme commented 8 months ago

This includes changing unit tests and documentation. Yes, it's quite a big change, but it will help get acquainted with the code base. Feel free to reach out to me for help or in case of questions.