Implement a file-based database for simulation results

nelimee commented 2 months ago

Is your feature request related to a problem? Please describe. The goal of our initiative is to generate graphs such as

In the above graph, each point is:

one quantum circuit, generated by tqec for a given value of k, and that can be represented as a stim file,
a large number of simulations performed by stim.

One problem is that Stim simulations are not free, and computing one point from the above graph can take minutes to hours of computational time.

Currently, we have no clever way of storing such data, meaning that the stim simulations have to be re-done each time we want to generate a new graph.

Describe the solution you'd like

We should have a database-like way of storing simulation data. There are multiple requirements:

we should be able to retrieve easily already existing results,
data should be written on disk,
we should be able to add new results to existing ones (typically, start a simulation with 1000 shots to see the overall look of the plot and check that there is not mistake, and once obvious mistakes have been corrected be able to launch 999000 more shots to reduce the error bars),
we should be able to remove existing results, but this should be hard to do (i.e., be wary of accidental data loss)

Note that simulation results might be quite heavy in terms of memory, so an optimised storage would be a plus.

inmzhang commented 2 months ago

We can think about utilizing the existing sampling tool like sinter. But as far as I know, currently there is no API provided by sinter to store the intermediate sampled detectors/observables to files.

nelimee commented 2 months ago

We can think about utilizing the existing sampling tool like sinter. But as far as I know, currently there is no API provided by sinter to store the intermediate sampled detectors/observables to files.

Yep, the goal of this issue is not the generation (which will very likely be handled by sinter as you note) but rather the storage of generated results.

Also, even if sinter had the possibility to store to files, we would need to have a clear organisation to allow easy retrieval, modification and deletion, so in any case we will need at least helper methods to do that.

Note that it looks a lot like the work done by a database, that might be a path to the solution.

afowler commented 2 months ago

Craig: can you comment on how Stim/sinter simulation results can be systematically stored so that one could later gather additional data for a plot to improve its statistics or explore a wider range of code distances and error rates?

On Fri, Jul 26, 2024 at 1:14 AM Adrien Suau @.***> wrote:

We can think about utilizing the existing sampling tool like sinter. But as far as I know, currently there is no API provided by sinter to store the intermediate sampled detectors/observables to files.

Yep, the goal of this issue is not the generation (which will very likely be handled by sinter as you note) but rather the storage of generated results.

— Reply to this email directly, view it on GitHub https://github.com/QCHackers/tqec/issues/273#issuecomment-2252212612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAXTEDMPTVC5TETCVFNTTZOIAOVAVCNFSM6AAAAABLOFRWF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSGIYTENRRGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

nelimee commented 2 months ago

Craig: can you comment on how Stim/sinter simulation results can be systematically stored so that one could later gather additional data for a plot to improve its statistics or explore a wider range of code distances and error rates?

Whenever I have a task like that, I really follow the database point of view:

I try to find a set of small data points that uniquely identify an "experiment" (in database terms, the primary key),
I try to store in the "experiment" (i.e., the value associated to the primary key) whatever I may need in the future.

In this specific case, I think that the primary key will be composed of:

an algorithmically generated (hash-like) key representing the experiment being benchmarked. For the moment, with the limited use-cases we explicitly target, I guess that we can compute such a hash (or a unique value if we really want to avoid any collision) by only considering:
- each block identifier ("xzx", "zxz", "xozh", ...),
- each block position (i.e., the position of its origin, that is uniquely defined for each block).
These can be directly obtained from the SketchUp file representing the computation and should be:
1. robust enough in the sense that if the computation does not change, the value should not change,
2. sensitive enough to avoid representing 2 different computations by the same value.
the value of k (determining the size of our logical qubits, and code distance),
the noise level might be tricky because of the floating-point representation, but there are ways around it that I think should be satisfactory for this use case, e.g., representing the noise level e = powerOfTenMantissa * 10**(-negativePowerOfTen) as a tuple (powerOfTenMantissa, negativePowerOfTen) where 0 <= powerOfTenMantissa <= 1 can be represented as a fraction.

The data stored will have to include the outputs of stim simulations (depending on what we need, direct measurements or detection events), and I think some metadata could be added to such a value such as:

date of data generation,
library versions used to generate the data,
custom annotations/tags provided by the user (e.g., "confidential", "internal use only", "public") to be able to filter out some data,
...

In terms of format, and because the main data we will store is binary anyway, I do not have any preferences and it can be anything (a real database, a file/folder-based storage, ...).

afowler commented 2 months ago

Sinter always hashes the circuit it was asked to simulate and the decoder it was asked to use, producing a cryptographically strong id. This id is stored alongside any statistics. When you merge multiple files, you match up statistics by this id when deciding whether or not to combine two entries into one entry.

I don't think "how to store stats" is particularly important to the goal of input-skeleton-output-circuit. That's later.

On Fri, Jul 26, 2024 at 6:32 AM Austin Fowler @.***> wrote:

Craig: can you comment on how Stim/sinter simulation results can be systematically stored so that one could later gather additional data for a plot to improve its statistics or explore a wider range of code distances and error rates?

On Fri, Jul 26, 2024 at 1:14 AM Adrien Suau @.***> wrote:

We can think about utilizing the existing sampling tool like sinter. But as far as I know, currently there is no API provided by sinter to store the intermediate sampled detectors/observables to files.

Yep, the goal of this issue is not the generation (which will very likely be handled by sinter as you note) but rather the storage of generated results.

— Reply to this email directly, view it on GitHub https://github.com/QCHackers/tqec/issues/273#issuecomment-2252212612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKAXTEDMPTVC5TETCVFNTTZOIAOVAVCNFSM6AAAAABLOFRWF2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJSGIYTENRRGI . You are receiving this because you are subscribed to this thread.Message ID: <QCHackers/tqec/issues/273/2252212612 @.***>

QCHackers / tqec

Implement a file-based database for simulation results #273