State module: new API - Githubissues

Armael commented 2 years ago

I recently realized that there is (was) a performance issue with marracheck, with it being a bottleneck when installing small packages. (marracheck would spend its time at 100% cpu instead of being I/O bound)

The culprit was tracked down (using perf + a flamegraph generating script, thanks https://github.com/ocaml-bench/notes/blob/master/profiling_notes.md) to be too-naive serialization code: when adding a package report for a single package (in a file which was designed to be easily appended to), we also re-dump the entire cover state, solution, etc, and spend a lot of CPU time in *.to_json functions.

Part of the issue comes from the fact that the current API of state.ml is not great and makes it easy to implement inefficient code from a serialization point of view.

This in turn motivated me to rewrite state.ml, with the following goals in mind:

introduce an abstraction barrier. Current state.ml offers no abstraction (consequently I have now completely forgotten how to use it correctly).
provide a simpler API. Current state.ml allows you to load the entire state from the disk as a record, that you update in-memory, and have to remember to re-dump to disk enough often (and then you want to do this efficiently without rewriting everything, which is tricky, and was the cause of the performance bug).
make it easy to change the filesystem layout. Our on-filesystem database resolves around a small number of concepts: files holding serialized data, directories (sometimes indexed by git). It would be nice to be able to handle basic file reading and serialization in a generic manner, and it would be nice to be able to easily update the layout of the filesystem (e.g. to add more pieces of serialized data when we need them etc) in a somewhat declarative manner.

The new state.ml (and fs.ml and data.ml) (this PR) implements the following strategy:

it implements a simpler API where the client does not need to hold a piece of state that it updates separately and needs to sync to disk from time to time. Instead, the API is such that :
- everytime the client needs data, it reads it from the disk;
- everytime the client updates data, it writes it to the disk.
If performance (of e.g. reading the same data many times) is a concern, it is trivial to implement caching as an optimization, without changing the API.
the basic functionality for serializing/deserializing files/handling git repositories is provided in a generic module, parameterized by a "schema" of the on-filesystem "database", which describes declaratively the expected layout of the filesystem.

(and to some extend handles stuff like automatically creating files with default values if they don't exist etc)

This is provided by fs.{ml,mli}.

This functionality is therefore agnostic wrt the actual layout of the marracheck data, and as such facilitates point 3 above (changing filesystem layout).
The Fs.Make functor, parameterized by the database "schema" is instantiated once in state.ml, with the current "marracheck" schema.

It then wraps it with a number of helpers to access the files that we know are in our database, and a number of higher-level operations that were in the previous state.ml.
data.ml gathers the serialization functions (of_json,to_json) for the data we want to store in files.
due to this two-layers abstraction (we provide a functor which we only instantiate once), we do pay for some additional "dynamic typing" checks (e.g. we dynamically check that paths belong to the schema, possibly several times for the "same" path). It would be easy to add a cache for those, but it doesn't seem to be an issue in practice (after some quick profiling).
I did check that the performance bug is gone. :-)

I'm opening this as a PR to possibly allow for comments wrt the APIs in fs.mli and state.mli ; I spent a bit of time on those trying to find a good balance between phantom types-based trickery and usability, but there may be some possible further improvements...

cc @gasche

Armael commented 2 years ago

As you note, the new API makes it more tricky to initialize (sub)trees. I pushed a new commit that improves the situation wrt mkdir. Previously mkdir would be possibly "unsafe": it would simply create the directory, which could break schema conformance if the newly created directory is required to contain some files (one would consequently take care of creating these files just after calling mkdir). The new API is that mkdir takes an optional init function as argument: after creating the directory, it will call init, and after that, it will check conformance wrt the schema, optionally creating files from default values if needed (as is done by load).

I pondered whether load could similarly take an init function as argument, for the case where one wants to load a db that might need additional initialization steps to make it schema-conformant. But in fact this would simply amount to running init before the current implementation of load, so nothing is gained: the user should just setup the directory up to schema conformance (for mandatory files), then call load.

Armael commented 2 years ago

Note that the new mkdir should be reentrant as expected, where one calls mkdir in the init function of a parent mkdir.

Armael commented 2 years ago

A new commit adds a cache in Fs (the lower level module) for successive reads: a read stores the files' contents in the cache indexed by its path; a successive read at the same path will load data from the cache instead of reading the disk; finally, calls to write invalidate the cache.

This goes hand in hand with a new API for recreating a directory (in a way that atomically preserves integrity of the database wrt the schema). This was previously done "by hand" in marracheck.ml without a dedicated API, but this would now be unsafe wrt the cache (one needs to invalidate the cache entries related to the directory being recreated).

Armael commented 2 years ago

For consistency I also added a remove function (that similarly invalidates cache entries), which is not used currently.

Armael / marracheck

State module: new API #8