Armael / marracheck

Gotta check them all!
10 stars 3 forks source link

State module: new API #8

Closed Armael closed 2 years ago

Armael commented 2 years ago

I recently realized that there is (was) a performance issue with marracheck, with it being a bottleneck when installing small packages. (marracheck would spend its time at 100% cpu instead of being I/O bound)

The culprit was tracked down (using perf + a flamegraph generating script, thanks https://github.com/ocaml-bench/notes/blob/master/profiling_notes.md) to be too-naive serialization code: when adding a package report for a single package (in a file which was designed to be easily appended to), we also re-dump the entire cover state, solution, etc, and spend a lot of CPU time in *.to_json functions.

Part of the issue comes from the fact that the current API of state.ml is not great and makes it easy to implement inefficient code from a serialization point of view.

This in turn motivated me to rewrite state.ml, with the following goals in mind:

The new state.ml (and fs.ml and data.ml) (this PR) implements the following strategy:

I'm opening this as a PR to possibly allow for comments wrt the APIs in fs.mli and state.mli ; I spent a bit of time on those trying to find a good balance between phantom types-based trickery and usability, but there may be some possible further improvements...

cc @gasche

Armael commented 2 years ago

As you note, the new API makes it more tricky to initialize (sub)trees. I pushed a new commit that improves the situation wrt mkdir. Previously mkdir would be possibly "unsafe": it would simply create the directory, which could break schema conformance if the newly created directory is required to contain some files (one would consequently take care of creating these files just after calling mkdir). The new API is that mkdir takes an optional init function as argument: after creating the directory, it will call init, and after that, it will check conformance wrt the schema, optionally creating files from default values if needed (as is done by load).

I pondered whether load could similarly take an init function as argument, for the case where one wants to load a db that might need additional initialization steps to make it schema-conformant. But in fact this would simply amount to running init before the current implementation of load, so nothing is gained: the user should just setup the directory up to schema conformance (for mandatory files), then call load.

Armael commented 2 years ago

Note that the new mkdir should be reentrant as expected, where one calls mkdir in the init function of a parent mkdir.

Armael commented 2 years ago

A new commit adds a cache in Fs (the lower level module) for successive reads: a read stores the files' contents in the cache indexed by its path; a successive read at the same path will load data from the cache instead of reading the disk; finally, calls to write invalidate the cache.

This goes hand in hand with a new API for recreating a directory (in a way that atomically preserves integrity of the database wrt the schema). This was previously done "by hand" in marracheck.ml without a dedicated API, but this would now be unsafe wrt the cache (one needs to invalidate the cache entries related to the directory being recreated).

Armael commented 2 years ago

For consistency I also added a remove function (that similarly invalidates cache entries), which is not used currently.