Added Quint specification

celestiaorg / nmt

Namespaced Merkle Tree

Apache License 2.0

117 stars 42 forks source link

Added Quint specification #163

Closed ivan-gavran closed 1 year ago

ivan-gavran commented 1 year ago

Overview

This PR introduces a formal specification of the NMT proof/verification logic in Quint.

The benefits of having a Quint specification are threefold:

It is a precise description of the expected behavior, and yet it resides on a higher level of abstraction than the code. Furthermore, it is executable, which makes it easier to spot and eliminate mistakes in the specification. Module nmt captures the logic of namespace proof generation and verification, and the invariant verificationAlwaysCorrect is an example of a property against which a specification can be checked.
It allows for test generation. Module nmtTest iteratively generates proofs and non-deterministically corrupts them. These generated test runs are exported in a json format (e.g., file ITF_traces/runTest.itf.json). To be executed as a part of the standard suite, an adapter simulation_test.go is necessary. (The adapter iterates through the json-represented execution state and translates them to function calls.) The generation of the tests happens through simulation. In that sense, it is similar to the existing tests fuzz_test.go, except that it also adds corruption of the data.
Having specifications written in Quint makes it possible to change tests/specs quickly: either by taking advantage of updates to Quint (e.g., going from simulation to exhaustive checks by changing a command) or by virtue of making changes on the level higher than code (and thus less details need to be changed).

Current limitations:

the specification does not model absence proofs
there is an assumption that every tree is full and complete
the specification does not model special handling of parity namespace (ignoreMaxNamespace option). Modelling it correctly depends on the decision of what the desired behaviour is (issue #148 )

Checklist

[ ] New and updated code has appropriate documentation
[ ] New and updated code has new and/or updated testing
[ ] Required CI checks are passing
[ ] Visual proof for any user facing features like CLI or documentation updates
[ ] Linked issues closed with keywords

ivan-gavran commented 1 year ago

Here is how to start experimenting with Quint and the model.

Installation

install quint tool by running npm i @informalsystems/quint -g (more details here)
install the VSCode plugin from here. While this is not necessary, it is of great help when writing models.

Basic resources

a sequence of very gentle and fun tutorials here
language cheatsheet
documentation for built-in operators here

REPL

After installing the quint tool, run the REPL by typing quint to terminal. Then you can play with the REPL by trying some basic expressions, e.g.

2+3
val x = Set(1,3,4)
x.powerset()
x.map(i => i*2)

Inspecting the NMT model

As a first step, examine the model operators from within REPL. Example commands to try out after running quint within the formal_spec folder:

.load nmt.qnt

import basics.*

import nmt_helpers.*

import nmt.*

val c_leaves = [{ value: ("data", 0), namespaceId: 0 }, { value: ("data", 1), namespaceId: 2 }, { value: ("data", 2), namespaceId: 2 }, { value: ("data", 3), namespaceId: 4 }]

val c_tree = BuildTree(c_leaves)

val proofZero = CreateProofNamespace(0, c_tree)

CreateProofNamespace(2, c_tree)

val c_root = c_tree.hashes.get(1)

verifyInclusionProof(proofZero, c_root, 0, [{ value: ("data", 0), namespaceId: 0 }])

MerkleRangeHash(proofZero, [{ value: ("data", 0), namespaceId: 0 }])

verifyInclusionProof(proofZero, c_root, 0, [{ value: ("data", 0), namespaceId: 2 }])

import nmtProofVerification.*

// runs the initialization action
init 

// runs one step of the model
step 

// runs another step of the model
step

After getting acquainted with all the operators, you can simulate the model behavior by running

quint run --main=nmtProofVerification --max-samples=1 --max-steps=100 nmt.qnt --invariant=verificationAlwaysCorrect This command will run the module nmtProofVerification, which iteratively generates data, creates a proof for a namespace and then verifies it. It will run 100 steps of the model and will check that the property verificationAlwaysCorrect always holds. It will output the steps to the terminal. (For options, run quint run --help)
quint run --main=nmtTest --max-samples=1 --max-steps=100 nmt.qnt --out-itf=ITF_files/out.itf.json This command will simulate the module nmtTest, which non-deterministically corrupts the generated proof and outputs the simulation steps to the file ITF_files/out.itf.json, to be used in tests.

Running tests

Once a test file is generated, it is read by simulation_test.go, a regular go test. (A path to the json file that should be tested needs to be given - at the moment hardcoded in the code.)

staheri14 commented 1 year ago

Thank you for providing instructions on using Quint and running tests. I am currently reviewing the PR and will provide my comments as I go. However, I have an immediate question: In the ITF_files directory, there are two JSON files - panicissue.itf.json and runTest.itf.json. Can you please explain how they were generated and how they differ from the output generated by the following command:

quint run --main=nmtTest --max-samples=1 --max-steps=100 nmt.qnt --out-itf=ITF_files/out.itf.json

Another question: What is the expected output when running TestFromITF using the panicIssue.itf.json file.

ivan-gavran commented 1 year ago

Thank you for providing instructions on using Quint and running tests. I am currently reviewing the PR and will provide my comments as I go. However, I have an immediate question: In the ITF_files directory, there are two JSON files - panicissue.itf.json and runTest.itf.json. Can you please explain how they were generated and how they differ from the output generated by the following command:
quint run --main=nmtTest --max-samples=1 --max-steps=100 nmt.qnt --out-itf=ITF_files/out.itf.json
Another question: What is the expected output when running TestFromITF using the panicIssue.itf.json file.

The runTest.itf.json was created exactly by running the command that you cite (perhaps with more execution steps, not fully sure about that). Then, when I ran the test, I noticed it caused the panic. So I inspected which steps were the one causing the panic and then I removed all those into a separate file, panicIssue.itf.json. Now, for panicIssue.itf.json I expect the test to panic (until the problem is fixed).

staheri14 commented 1 year ago

Thanks @ivan-gavran for addressing my previous comments. After examining the current Quint model in greater detail and reviewing some of the examples available in the Quint repository on GitHub, there are certain aspects that still require further clarification. I have listed these aspects below and would appreciate your insights on them.

It is a precise description of the expected behavior, and yet it resides on a higher level of abstraction than the code.

Can you explain the level of abstraction provided by the Quint model? For instance, when looking at the CreateProofNamespace function (https://github.com/ivan-gavran/nmt/blob/c3cc6b7acba34c97a1a4d5e8fa4be1d355535c1e/formal_spec/nmt.qnt#L277 ), the code appears to be just as detailed and complex as the original Go version. It would very helpful if you please clarify how the Quint model abstracts away the implementation details?

It allows for test generation.

How does the process of generating test cases in the Quint model compare to generating them programmatically in Go? Specifically, the Quint model requires us 1) to write the code to generate the test vectors e.g., creating a tree 2) to specify the corruption scenarios, and develop necessary code to produce corrupt inputs. Can you please discuss the advantages of using the Quint model for test vector generation, and how it compares to generating test vectors programmatically in Go?
In the nmtTest module of the nmt.qnt file, the sequences of steps are generated randomly. This means that depending on the value passed in the --max-samples option and the randomness used to pick those step sequences, we may or may not cover all the corruption cases in the output test cases. I am interested to know your perspective on the confidence we can have in the test vectors generated by the Quint model and the coverage it provides. Additionally, do you anticipate that Quint testing will uncover more bugs than traditional Go testing? How do you compare the tests generated by Quint with the existing tests in the nmt repository?

Having specifications written in Quint makes it possible to change tests/specs quickly

Could you provide an example to clarify this aspect? It seems that for every new feature, we must write code in both the Quint language and the original Go code. Would this dual coding process result in a quicker change process? Specifically, how does the use of the Quint language reduce the overall time and effort required to implement new features or changes to existing features in comparison to only using the Go code?

ivan-gavran commented 1 year ago

Hi @staheri14 , thanks for the questions and for digging deeper into this PR. I am answering your question one-by-one below.

Can you explain the level of abstraction provided by the Quint model? For instance, when looking at the CreateProofNamespace function (https://github.com/ivan-gavran/nmt/blob/c3cc6b7acba34c97a1a4d5e8fa4be1d355535c1e/formal_spec/nmt.qnt#L277 ), the code appears to be just as detailed and complex as the original Go version. It would very helpful if you please clarify how the Quint model abstracts away the implementation details?

You are right that the model version and Go version of the computation here are at a similar level of abstraction and complexity. This is because NMT is more about computation than communication or interleaving of messages. My quoted comment was referring to general benefits of using models.

Where we do gain on readability of NMT are the test cases: using the model, we treat test-cases explicitly as data (whereas in the original Go code they are intertwined with test execution).

How does the process of generating test cases in the Quint model compare to generating them programmatically in Go? Specifically, the Quint model requires us 1) to write the code to generate the test vectors e.g., creating a tree 2) to specify the corruption scenarios, and develop necessary code to produce corrupt inputs. Can you please discuss the advantages of using the Quint model for test vector generation, and how it compares to generating test vectors programmatically in Go?

The current test generation (as present in this PR) is the most similar to the existing fuzz_test.go. With Quint, we could additionally generate particular high-level scenarios. (This is not present in the PR, sending a commit tomorrow to illustrate.) For instance: generate tests that fail, generate tests that fail for a particular reason, or generate tests that succeed. (In the currently existing tests, these cannot be generated programmatically, but are instead hand-written). The same Quint model can thus be used for tests such as those in nmt_test.go (except that those are explicitly hand-written) and tests such as those in fuzz_test.go.

In the nmtTest module of the nmt.qnt file, the sequences of steps are generated randomly. This means that depending on the value passed in the --max-samples option and the randomness used to pick those step sequences, we may or may not cover all the corruption cases in the output test cases. I am interested to know your perspective on the confidence we can have in the test vectors generated by the Quint model and the coverage it provides. Additionally, do you anticipate that Quint testing will uncover more bugs than traditional Go testing? How do you compare the tests generated by Quint with the existing tests in the nmt repository?

This touches on the three modes I was describing: a random simulation, symbolic simulation, and full verification. Random simulation would probably work equally well as the current fuzz_test.go. The other two modes (still to be added to Quint, by connecting it to the Apalache model checker) provide better coverage, by not depending on sheer luck to stumble upon an interesting specified(!) scenarios, but guaranteeing to find them if they exist.

I'd like to emphasise that Quint tests are really Go tests. The difference comes from separating the test-case generation and execution. In the "traditional testing approach", the test cases (such as this tabular sequence) are written by hand; in the "Quint testing", the test cases are generated by the tool. This makes it easier to generate a much larger number of test cases (depending on the state-space size, even exhausting all possible test cases up to a certain number of steps) and in that way making it more likely that a rare bug would be found.

Having specifications written in Quint makes it possible to change tests/specs quickly

Could you provide an example to clarify this aspect? It seems that for every new feature, we must write code in both the Quint language and the original Go code. Would this dual coding process result in a quicker change process? Specifically, how does the use of the Quint language reduce the overall time and effort required to implement new features or changes to existing features in comparison to only using the Go code?

The point I wanted to make was that as the codebase evolves (e.g., parameters of a certain function are changed, or the logic changes making some test-cases behave differently etc.), we do not have to re-write a large number of manually written test cases. Instead, we could change the model and the go function which reads the generated test cases and executes them, and have again a huge number of test cases re-generated.

Finally, it is fair to say that Quint is a new technology so best testing practices are still being established and evaluated. What I believe to be the most useful feature of Quint is that it takes care of generating tests through different algorithms, at the same price of having to write a model + Go adapter. The overall effect should be that we can write tests much faster (since one never has to write by hand tabular tests and make sure not to forget a case).

staheri14 commented 1 year ago

Thanks @ivan-gavran for your thorough response!
Please see my followup questions below:

You are right that the model version and Go version of the computation here are at a similar level of abstraction and complexity. This is because NMT is more about computation than communication or interleaving of messages. My quoted comment was referring to general benefits of using models.

Where we do gain on readability of NMT are the test cases: using the model, we treat test-cases explicitly as data (whereas in the original Go code they are intertwined with test execution).

Thanks for clarifying your point. Based on your explanation, the Quint model for nmt simplifies and automates the process of generating test data, which has two primary benefits (echoing what you said essentially): 1) it allows for the identification of hidden corner cases that would otherwise be difficult to discover manually; and 2) it reduces the workload for reproducing test data and refactoring code when new features are added.

However, as the current state of the PR, the test data generation follows a random simulation, hence, coverage-wise (referring to the first point above) the generated test data is similar to ones generated in the fuzz_test.go, right?

I would also appreciate it if you could clarify whether Quint can be used as a specification for nmt, I would be curious to learn more about how that would work. For example, could the quint model (compared to the current go implementation) be leveraged to facilitate the implementation of nmt in a different language? My understanding is that it cannot replace a formal english specification, and we should not rely on Quint as a replacement of specifications. Would like to know your point of view on this.

How does the process of generating test cases in the Quint model compare to generating them programmatically in Go? Specifically, the Quint model requires us 1) to write the code to generate the test vectors e.g., creating a tree 2) to specify the corruption scenarios, and develop necessary code to produce corrupt inputs. Can you please discuss the advantages of using the Quint model for test vector generation, and how it compares to generating test vectors programmatically in Go?

The current test generation (as present in this PR) is the most similar to the existing fuzz_test.go. With Quint, we could additionally generate particular high-level scenarios. (This is not present in the PR, sending a commit tomorrow to illustrate.) For instance: generate tests that fail, generate tests that fail for a particular reason, or generate tests that succeed. (In the currently existing tests, these cannot be generated programmatically, but are instead hand-written). The same Quint model can thus be used for tests such as those in nmt_test.go (except that those are explicitly hand-written) and tests such as those in fuzz_test.go.

I'm interested to know how to generate test data that intentionally fail for specific reasons using quint and how to incorporate that data into the tests of the nmt_test.go file. I assume that with your upcoming commit, this process will become clearer. My understanding is that, as each test in the nmt_test.go file requires data to be corrupted in a particular way, it seems that we need to add all those corruption scenarios to the quint model under the nmt_test module. Then, using the go adapter, we can parse the JSON file containing the test data, examine the type of corruption in each case, and call the corresponding test function from nmt_test.go (or alternatively make each test in the nmt_test.go read the JSON file and finds data that is relevant to its test). Is this the intended workflow for using the resulting test data to replace the hardcoded test vectors in nmt_test.go?

Also, I think quint-generated test data are not meant for unit testing of the functions? For example, tests such as https://github.com/celestiaorg/nmt/blob/05017fc627d0cdb0aaba84293141f67010485648/hasher_test.go#L67 (although, I do not see any immediate reason that this would not be possible). Can you please elaborate on this?

In the nmtTest module of the nmt.qnt file, the sequences of steps are generated randomly. This means that depending on the value passed in the --max-samples option and the randomness used to pick those step sequences, we may or may not cover all the corruption cases in the output test cases. I am interested to know your perspective on the confidence we can have in the test vectors generated by the Quint model and the coverage it provides. Additionally, do you anticipate that Quint testing will uncover more bugs than traditional Go testing? How do you compare the tests generated by Quint with the existing tests in the nmt repository? This touches on the three modes I was describing: a random simulation, symbolic simulation, and full verification. Random simulation would probably work equally well as the current fuzz_test.go. The other two modes (still to be added to Quint, by connecting it to the Apalache model checker) provide better coverage, by not depending on sheer luck to stumble upon an interesting specified(!) scenarios, but guaranteeing to find them if they exist.

The point I wanted to make was that as the codebase evolves (e.g., parameters of a certain function are changed, or the logic changes making some test-cases behave differently etc.), we do not have to re-write a large number of manually written test cases. Instead, we could change the model and the go function which reads the generated test cases and executes them, and have again a huge number of test cases re-generated.

I see your point, thanks for clarification! Currently, the test data generated in this PR is using the random simulation, is that correct? How can we use the other two modes?

codecov[bot] commented 1 year ago

Codecov Report

Merging #163 (b6c7c58) into master (6854976) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #163   +/-   ##
=======================================
  Coverage   95.39%   95.39%           
=======================================
  Files           5        5           
  Lines         565      565           
=======================================
  Hits          539      539           
  Misses         15       15           
  Partials       11       11

ivan-gavran commented 1 year ago

I like this discussion :)

[...] Based on your explanation, the Quint model for nmt simplifies and automates the process of generating test data, which has two primary benefits (echoing what you said essentially): 1) it allows for the identification of hidden corner cases that would otherwise be difficult to discover manually; and 2) it reduces the workload for reproducing test data and refactoring code when new features are added.

However, as the current state of the PR, the test data generation follows a random simulation, hence, coverage-wise (referring to the first point above) the generated test data is similar to ones generated in the fuzz_test.go, right?

That is correct!

I would also appreciate it if you could clarify whether Quint can be used as a specification for nmt, I would be curious to learn more about how that would work. For example, could the quint model (compared to the current go implementation) be leveraged to facilitate the implementation of nmt in a different language? My understanding is that it cannot replace a formal english specification, and we should not rely on Quint as a replacement of specifications. Would like to know your point of view on this.

I believe Quint could be used as a specification for nmt. The most promising approach would be to interleave English descriptions (for intuition) with Quint expressions (for precision). For the NMT spec, the description of treating the ignoreMaxNamespace flag (here) caused some confusion for me as a reader. If that part was instead replaced by a Quint function, that would make it clear and would allow me to inspect the spec for different inputs. One example of such approach is an (unfinished) intro to the bank module - linking in particular descriptions of ViewKeeper's functions.

More generally, I see Quint code as a replacement for pseudocode - and now you have specs with pseudocode you can simulate, automatically reason about, or connect to test cases.

I'm interested to know how to generate test data that intentionally fail for specific reasons using quint and how to incorporate that data into the tests of the nmt_test.go file. I assume that with your upcoming commit, this process will become clearer. My understanding is that, as each test in the nmt_test.go file requires data to be corrupted in a particular way, it seems that we need to add all those corruption scenarios to the quint model under the nmt_test module. Then, using the go adapter, we can parse the JSON file containing the test data, examine the type of corruption in each case, and call the corresponding test function from nmt_test.go (or alternatively make each test in the nmt_test.go read the JSON file and finds data that is relevant to its test). Is this the intended workflow for using the resulting test data to replace the hardcoded test vectors in nmt_test.go?

Also, I think quint-generated test data are not meant for unit testing of the functions? For example, tests such as

https://github.com/celestiaorg/nmt/blob/05017fc627d0cdb0aaba84293141f67010485648/hasher_test.go#L67

(although, I do not see any immediate reason that this would not be possible). Can you please elaborate on this?

In the latest commit, I wrote the spec and the Go adapter that mimic the tests from TestNamespacedMerkleTree_ProveNamespace_Ranges_And_Verify. You will notice that the way in which I generate different scenarios (following the scenarios from the original test) is by defining an invariant. For instance, for the test case described by

{
            "three leaves and not found but with range", 2,
            append(repeat(generateLeafData(2, 0, 1, []byte("_data")), 2), newNamespaceDataPair([]byte{1, 1}, []byte("_data"))),
            []byte{0, 1},
            2, 3,
            false,
        },

the invariant is defined by

     val fiveLeavesAndNotFoundButWithinRange = 
        val leavesSet = listToSet(leaves_namespace_idx_v)
        and{
            length(leaves_namespace_idx_v) == 5,            
            not(leavesSet.contains(namespace_v)),
            min(leavesSet) < namespace_v,
            max(leavesSet) > namespace_v
        }

The difference between the original test description and Quint description is that the original description is simultaneously describing a scenario and technically achieving it. In the commit, the Quint part only describes what a test scenario should be about, while the technical details (e.g., byte representation) are left to the adapter and exact data values are generated (and stored in a .json file, to be consumed by the adapter).

After negating the invariant, a test case is found as a violation of that (negated) invariant.

This example illustrates that one does not have to explicitly write desired corruption cases. What we do instead is describe the data by an invariant.

You are right that Quint is not meant primarily for units tests, but can be used for it, too (as the example illustrates).

The point I wanted to make was that as the codebase evolves (e.g., parameters of a certain function are changed, or the logic changes making some test-cases behave differently etc.), we do not have to re-write a large number of manually written test cases. Instead, we could change the model and the go function which reads the generated test cases and executes them, and have again a huge number of test cases re-generated.

I see your point, thanks for clarification! Currently, the test data generated in this PR is using the random simulation, is that correct? How can we use the other two modes?

Yes, the test data is generated using random simulation. The other two modes are in the coming soon stage: they depend on connecting Quint to the Apalache model checker (here is the tracking issue)

ivan-gavran commented 1 year ago

In the previous comment, I mentioned the new commit, which mimics tests from TestNamespacedMerkleTree_ProveNamespace_Ranges_And_Verify, but written using Quint. I'd like to add some more details about how that is done:

there are three variables of interest: namespaces of leaves, the namespaces that is queried, and whether the queried namespace is among leaves
unlike the spec from nmt.qnt, here I do not generate all leaves at once, but generate one leaf per step. (An approach in which I would generate all at once is also possible. This touches on the @staheri14 comment on using Quint for unit tests)

whenever we wanted to create a test scenario, we write a property describing it. For instance, for the property that was in the .go file given by a description and exact values with:

{
        "two leaves and found", 1,
        append(generateLeafData(1, 0, 1, []byte("_data")), generateLeafData(1, 1, 2, []byte("_data"))...),
        []byte{1},
        1, 2,
        true,
    }

, Quint defines it by

val twoLeavesAndFound = 
        and{
            length(leaves_namespace_idx_v) == 2,
            val leavesSet = listToSet(leaves_namespace_idx_v)
            leavesSet.contains(namespace_v)
        }

Finally, we negate the property and use the negated property as an invariant, whose violation is then a test case. Concreately, to obtain the test case data, we can run quint run --max-samples=1000 --max-steps=10 tests.qnt --invariant=twoLeavesAndFoundTest

DISCLAIMER: the final json file, consisting of all test cases, scenario.json, is manually stitched together. (Generation of multiple test cases for multiple properties is in the pipeline.)

staheri14 commented 1 year ago

Yes, the test data is generated using random simulation. The other two modes are in the coming soon stage: they depend on connecting Quint to the Apalache model checker (https://github.com/informalsystems/quint/issues/12 is the tracking issue)

Would the current quint model for nmt be applicable for the other two simulation modes or would a different quint model be required for those modes?

ivan-gavran commented 1 year ago

Would the current quint model for nmt be applicable for the other two simulation modes or would a different quint model be required for those modes?

The same model can be used for all three modes. The difference is only in how they are looking for an invariant violation (and, therefore, what guarantees they can provide):

random simulation: at every step selects all the values (action to execute, nondetermined values etc. randomly). No guarantees, but it's very fast. Thus, if we run many simulations, we might stumble upon an execution that violates the invariant.
symbolic simulation: at every step it selects an action to execute at random, but, for the selected action, exhaustively inspects all possible values. A bit slower, but gives a guarantee that for all inspected action sequences, if no violation was found, then there is no violation (for those sequences).
verification: for a given number of actions n, it will make sure that all possible combinations (of different actions and different values) are explored and that if no violation is found, then there is not violation (of length up to n).

ivan-gavran commented 1 year ago

In this commit I corrected a mistake that was introduced while prettifying the spec, which resulted in generation of empty test-data.

I also added comments around Quint functions and did some go linting. Marking now this PR as ready.

staheri14 commented 1 year ago

@ivan-gavran Thanks a lot for addressing the comments, they look good to me! The issue with gobencher is currently being looked into. As soon as it's resolved, I'll let you know. At that point, I'll kindly request you merge the master branch into your branch one last time. After that, we should be ready to merge your PR :)

staheri14 commented 1 year ago

@ivan-gavran Would you mind updating your branch with the master branch for the last time, then we are good to merge this PR (given that CIs are fine excluding the gobencher). Thanks!