Current developer interface (from example documentation):
def my_reader(uri: str) -> MyType:
"""Read the object from an fsspec URI."""
return MyType(UPath(uri).read_text())
def my_writer(obj: MyType, uri: str) -> None:
"""Write the object to an fsspec URI."""
UPath(uri).write_text(obj)
MyWrappedType = cereal.wrap_type(MyType, reader=my_reader, writer=my_writer)
However, this is limiting - there are filesystems that use other filesystems as a base and they can't always be done via the described "URL chaining", as you need to pass in keyword arguments. We can't use open_files because we actually don't know what files we'll be writing/reading ahead of time, so we can't use the OpenFile/OpenFiles interface that fsspec provides.
So, we should instead keep around two references: the filesystem: AbstractFileSystem object (which references some file system) and the path: str (is it str?...) that acts as a "relative (or absolute) path" within that filesystem. That way you can open a filesystem that is a Zip file within a secure cloud storage bucket... and still traverse paths within it.
Additionally, having the filesystem reference allows us to use transactions to avoid many issues with data corruption.
Specification
Reader/Writer API
Reader and Writer should have a different Protocol definition. Something similar to this:
@runtime_checkable
class CerealReader(Protocol[T_read]):
"""Reader class for a particular type."""
@abstractmethod
def __call__(self, fs: AbstractFileSystem, path: str) -> T_read:
"""Read data from the given path within the filesystem."""
@runtime_checkable
class CerealWriter(Protocol[T_write]):
"""Writer class for a particular type."""
@abstractmethod
def __call__(self, obj: T_write, fs: AbstractFileSystem, path: str) -> Any:
"""Write data to the given path within the filesystem."""
This forces user-defined readers/writers to be a bit more robust, because it is more explicit (users should support fsspec specifically) than the previous one ("ok here's a URI, load it somehow I guess? oh and we use fsspec").
Cereal API
For the Cereal API, we want to allow usage with just a "URI" (as before) but also allow specifying a filesystem directly. So the new function APIs are:
class Cereal(object):
...
def write_model(self, model: BaseModel, target_path: Union[UPath, Path, str], fs: Optional[AbstractFileSystem] = None) -> str:
"""Write the pydantic.BaseModel to the target_path (as a directory).
Parameters
----------
<todo: describe params>
Usage
-----
If `fs` is given, `target_path` must be a `str` and not contain any protocols (i.e. should be a path within that file system).
If `fs` is None, we will attempt to infer the filesystem type from the `target_path`.
If `target_path` is a `str`, we will use `fsspec` logic to load the file system and path.
If `target_path` is a `Path` (from `pathlib`), a local filesystem will be inferred.
If `target_path` is a `UPath` (from `universal_pathlib`/`upath`), we will use the given filesystem.
"""
def read_model(
self,
source_path: Union[UPath, Path, str],
fs: Optional[AbstractFileSystem] = None,
*,
supercls: Type[TModel] = BaseModel, # type: ignore
) -> TModel:
"""Read a pydantic.BaseModel from the path.
<same logic as above>
"""
Note that argument names have been changed - perhaps this isn't the best naming scheme for parameters, but explicit is better than having a short "URI" that could really mean anything to the user. 😉
This includes changing unit tests and documentation. Yes, it's quite a big change, but it will help get acquainted with the code base. Feel free to reach out to me for help or in case of questions.
Motivation
We use
fsspec
(and currentlyuniversal_pathlib
) to specify both the user interface and the developer interface.Current user interface (from example documentation):
Current developer interface (from example documentation):
However, this is limiting - there are filesystems that use other filesystems as a base and they can't always be done via the described "URL chaining", as you need to pass in keyword arguments. We can't use
open_files
because we actually don't know what files we'll be writing/reading ahead of time, so we can't use theOpenFile
/OpenFiles
interface thatfsspec
provides.So, we should instead keep around two references: the
filesystem: AbstractFileSystem
object (which references some file system) and thepath: str
(is it str?...) that acts as a "relative (or absolute) path" within that filesystem. That way you can open a filesystem that is a Zip file within a secure cloud storage bucket... and still traverse paths within it.Additionally, having the filesystem reference allows us to use transactions to avoid many issues with data corruption.
Specification
Reader/Writer API
Reader and Writer should have a different Protocol definition. Something similar to this:
This forces user-defined readers/writers to be a bit more robust, because it is more explicit (users should support
fsspec
specifically) than the previous one ("ok here's a URI, load it somehow I guess? oh and we usefsspec
").Cereal API
For the
Cereal
API, we want to allow usage with just a "URI" (as before) but also allow specifying afilesystem
directly. So the new function APIs are:Note that argument names have been changed - perhaps this isn't the best naming scheme for parameters, but explicit is better than having a short "URI" that could really mean anything to the user. 😉