celestiaorg / nmt

Namespaced Merkle Tree
Apache License 2.0
117 stars 42 forks source link

Added Quint specification #163

Closed ivan-gavran closed 1 year ago

ivan-gavran commented 1 year ago

Overview

This PR introduces a formal specification of the NMT proof/verification logic in Quint.

The benefits of having a Quint specification are threefold:

Current limitations:

Checklist

ivan-gavran commented 1 year ago

Here is how to start experimenting with Quint and the model.

Installation

Basic resources

REPL

After installing the quint tool, run the REPL by typing quint to terminal. Then you can play with the REPL by trying some basic expressions, e.g.

Inspecting the NMT model

As a first step, examine the model operators from within REPL. Example commands to try out after running quint within the formal_spec folder:

.load nmt.qnt

import basics.*

import nmt_helpers.*

import nmt.*

val c_leaves = [{ value: ("data", 0), namespaceId: 0 }, { value: ("data", 1), namespaceId: 2 }, { value: ("data", 2), namespaceId: 2 }, { value: ("data", 3), namespaceId: 4 }]

val c_tree = BuildTree(c_leaves)

val proofZero = CreateProofNamespace(0, c_tree)

CreateProofNamespace(2, c_tree)

val c_root = c_tree.hashes.get(1)

verifyInclusionProof(proofZero, c_root, 0, [{ value: ("data", 0), namespaceId: 0 }])

MerkleRangeHash(proofZero, [{ value: ("data", 0), namespaceId: 0 }])

verifyInclusionProof(proofZero, c_root, 0, [{ value: ("data", 0), namespaceId: 2 }])

import nmtProofVerification.*

// runs the initialization action
init 

// runs one step of the model
step 

// runs another step of the model
step 

After getting acquainted with all the operators, you can simulate the model behavior by running

Running tests

Once a test file is generated, it is read by simulation_test.go, a regular go test. (A path to the json file that should be tested needs to be given - at the moment hardcoded in the code.)

staheri14 commented 1 year ago

Thank you for providing instructions on using Quint and running tests. I am currently reviewing the PR and will provide my comments as I go. However, I have an immediate question: In the ITF_files directory, there are two JSON files - panicissue.itf.json and runTest.itf.json. Can you please explain how they were generated and how they differ from the output generated by the following command:

quint run --main=nmtTest --max-samples=1 --max-steps=100 nmt.qnt --out-itf=ITF_files/out.itf.json

Another question: What is the expected output when running TestFromITF using the panicIssue.itf.json file.

ivan-gavran commented 1 year ago

Thank you for providing instructions on using Quint and running tests. I am currently reviewing the PR and will provide my comments as I go. However, I have an immediate question: In the ITF_files directory, there are two JSON files - panicissue.itf.json and runTest.itf.json. Can you please explain how they were generated and how they differ from the output generated by the following command:

quint run --main=nmtTest --max-samples=1 --max-steps=100 nmt.qnt --out-itf=ITF_files/out.itf.json

Another question: What is the expected output when running TestFromITF using the panicIssue.itf.json file.

The runTest.itf.json was created exactly by running the command that you cite (perhaps with more execution steps, not fully sure about that). Then, when I ran the test, I noticed it caused the panic. So I inspected which steps were the one causing the panic and then I removed all those into a separate file, panicIssue.itf.json. Now, for panicIssue.itf.json I expect the test to panic (until the problem is fixed).

staheri14 commented 1 year ago

Thanks @ivan-gavran for addressing my previous comments. After examining the current Quint model in greater detail and reviewing some of the examples available in the Quint repository on GitHub, there are certain aspects that still require further clarification. I have listed these aspects below and would appreciate your insights on them.

It is a precise description of the expected behavior, and yet it resides on a higher level of abstraction than the code.

  1. Can you explain the level of abstraction provided by the Quint model? For instance, when looking at the CreateProofNamespace function (https://github.com/ivan-gavran/nmt/blob/c3cc6b7acba34c97a1a4d5e8fa4be1d355535c1e/formal_spec/nmt.qnt#L277 ), the code appears to be just as detailed and complex as the original Go version. It would very helpful if you please clarify how the Quint model abstracts away the implementation details?

It allows for test generation.

  1. How does the process of generating test cases in the Quint model compare to generating them programmatically in Go? Specifically, the Quint model requires us 1) to write the code to generate the test vectors e.g., creating a tree 2) to specify the corruption scenarios, and develop necessary code to produce corrupt inputs. Can you please discuss the advantages of using the Quint model for test vector generation, and how it compares to generating test vectors programmatically in Go?

  2. In the nmtTest module of the nmt.qnt file, the sequences of steps are generated randomly. This means that depending on the value passed in the --max-samples option and the randomness used to pick those step sequences, we may or may not cover all the corruption cases in the output test cases. I am interested to know your perspective on the confidence we can have in the test vectors generated by the Quint model and the coverage it provides. Additionally, do you anticipate that Quint testing will uncover more bugs than traditional Go testing? How do you compare the tests generated by Quint with the existing tests in the nmt repository?

Having specifications written in Quint makes it possible to change tests/specs quickly

  1. Could you provide an example to clarify this aspect? It seems that for every new feature, we must write code in both the Quint language and the original Go code. Would this dual coding process result in a quicker change process? Specifically, how does the use of the Quint language reduce the overall time and effort required to implement new features or changes to existing features in comparison to only using the Go code?
ivan-gavran commented 1 year ago

Hi @staheri14 , thanks for the questions and for digging deeper into this PR. I am answering your question one-by-one below.

  1. Can you explain the level of abstraction provided by the Quint model? For instance, when looking at the CreateProofNamespace function (https://github.com/ivan-gavran/nmt/blob/c3cc6b7acba34c97a1a4d5e8fa4be1d355535c1e/formal_spec/nmt.qnt#L277 ), the code appears to be just as detailed and complex as the original Go version. It would very helpful if you please clarify how the Quint model abstracts away the implementation details?

You are right that the model version and Go version of the computation here are at a similar level of abstraction and complexity. This is because NMT is more about computation than communication or interleaving of messages. My quoted comment was referring to general benefits of using models.

Where we do gain on readability of NMT are the test cases: using the model, we treat test-cases explicitly as data (whereas in the original Go code they are intertwined with test execution).

  1. How does the process of generating test cases in the Quint model compare to generating them programmatically in Go? Specifically, the Quint model requires us 1) to write the code to generate the test vectors e.g., creating a tree 2) to specify the corruption scenarios, and develop necessary code to produce corrupt inputs. Can you please discuss the advantages of using the Quint model for test vector generation, and how it compares to generating test vectors programmatically in Go?

The current test generation (as present in this PR) is the most similar to the existing fuzz_test.go. With Quint, we could additionally generate particular high-level scenarios. (This is not present in the PR, sending a commit tomorrow to illustrate.) For instance: generate tests that fail, generate tests that fail for a particular reason, or generate tests that succeed. (In the currently existing tests, these cannot be generated programmatically, but are instead hand-written). The same Quint model can thus be used for tests such as those in nmt_test.go (except that those are explicitly hand-written) and tests such as those in fuzz_test.go.

  1. In the nmtTest module of the nmt.qnt file, the sequences of steps are generated randomly. This means that depending on the value passed in the --max-samples option and the randomness used to pick those step sequences, we may or may not cover all the corruption cases in the output test cases. I am interested to know your perspective on the confidence we can have in the test vectors generated by the Quint model and the coverage it provides. Additionally, do you anticipate that Quint testing will uncover more bugs than traditional Go testing? How do you compare the tests generated by Quint with the existing tests in the nmt repository?

This touches on the three modes I was describing: a random simulation, symbolic simulation, and full verification. Random simulation would probably work equally well as the current fuzz_test.go. The other two modes (still to be added to Quint, by connecting it to the Apalache model checker) provide better coverage, by not depending on sheer luck to stumble upon an interesting specified(!) scenarios, but guaranteeing to find them if they exist.

I'd like to emphasise that Quint tests are really Go tests. The difference comes from separating the test-case generation and execution. In the "traditional testing approach", the test cases (such as this tabular sequence) are written by hand; in the "Quint testing", the test cases are generated by the tool. This makes it easier to generate a much larger number of test cases (depending on the state-space size, even exhausting all possible test cases up to a certain number of steps) and in that way making it more likely that a rare bug would be found.

Having specifications written in Quint makes it possible to change tests/specs quickly

  1. Could you provide an example to clarify this aspect? It seems that for every new feature, we must write code in both the Quint language and the original Go code. Would this dual coding process result in a quicker change process? Specifically, how does the use of the Quint language reduce the overall time and effort required to implement new features or changes to existing features in comparison to only using the Go code?

The point I wanted to make was that as the codebase evolves (e.g., parameters of a certain function are changed, or the logic changes making some test-cases behave differently etc.), we do not have to re-write a large number of manually written test cases. Instead, we could change the model and the go function which reads the generated test cases and executes them, and have again a huge number of test cases re-generated.


Finally, it is fair to say that Quint is a new technology so best testing practices are still being established and evaluated. What I believe to be the most useful feature of Quint is that it takes care of generating tests through different algorithms, at the same price of having to write a model + Go adapter. The overall effect should be that we can write tests much faster (since one never has to write by hand tabular tests and make sure not to forget a case).

staheri14 commented 1 year ago

Thanks @ivan-gavran for your thorough response!
Please see my followup questions below:

You are right that the model version and Go version of the computation here are at a similar level of abstraction and complexity. This is because NMT is more about computation than communication or interleaving of messages. My quoted comment was referring to general benefits of using models.

Where we do gain on readability of NMT are the test cases: using the model, we treat test-cases explicitly as data (whereas in the original Go code they are intertwined with test execution).

Thanks for clarifying your point. Based on your explanation, the Quint model for nmt simplifies and automates the process of generating test data, which has two primary benefits (echoing what you said essentially): 1) it allows for the identification of hidden corner cases that would otherwise be difficult to discover manually; and 2) it reduces the workload for reproducing test data and refactoring code when new features are added.

However, as the current state of the PR, the test data generation follows a random simulation, hence, coverage-wise (referring to the first point above) the generated test data is similar to ones generated in the fuzz_test.go, right?

I would also appreciate it if you could clarify whether Quint can be used as a specification for nmt, I would be curious to learn more about how that would work. For example, could the quint model (compared to the current go implementation) be leveraged to facilitate the implementation of nmt in a different language? My understanding is that it cannot replace a formal english specification, and we should not rely on Quint as a replacement of specifications. Would like to know your point of view on this.

How does the process of generating test cases in the Quint model compare to generating them programmatically in Go? Specifically, the Quint model requires us 1) to write the code to generate the test vectors e.g., creating a tree 2) to specify the corruption scenarios, and develop necessary code to produce corrupt inputs. Can you please discuss the advantages of using the Quint model for test vector generation, and how it compares to generating test vectors programmatically in Go?

The current test generation (as present in this PR) is the most similar to the existing fuzz_test.go. With Quint, we could additionally generate particular high-level scenarios. (This is not present in the PR, sending a commit tomorrow to illustrate.) For instance: generate tests that fail, generate tests that fail for a particular reason, or generate tests that succeed. (In the currently existing tests, these cannot be generated programmatically, but are instead hand-written). The same Quint model can thus be used for tests such as those in nmt_test.go (except that those are explicitly hand-written) and tests such as those in fuzz_test.go.

I'm interested to know how to generate test data that intentionally fail for specific reasons using quint and how to incorporate that data into the tests of the nmt_test.go file. I assume that with your upcoming commit, this process will become clearer. My understanding is that, as each test in the nmt_test.go file requires data to be corrupted in a particular way, it seems that we need to add all those corruption scenarios to the quint model under the nmt_test module. Then, using the go adapter, we can parse the JSON file containing the test data, examine the type of corruption in each case, and call the corresponding test function from nmt_test.go (or alternatively make each test in the nmt_test.go read the JSON file and finds data that is relevant to its test). Is this the intended workflow for using the resulting test data to replace the hardcoded test vectors in nmt_test.go?

Also, I think quint-generated test data are not meant for unit testing of the functions? For example, tests such as https://github.com/celestiaorg/nmt/blob/05017fc627d0cdb0aaba84293141f67010485648/hasher_test.go#L67 (although, I do not see any immediate reason that this would not be possible). Can you please elaborate on this?

In the nmtTest module of the nmt.qnt file, the sequences of steps are generated randomly. This means that depending on the value passed in the --max-samples option and the randomness used to pick those step sequences, we may or may not cover all the corruption cases in the output test cases. I am interested to know your perspective on the confidence we can have in the test vectors generated by the Quint model and the coverage it provides. Additionally, do you anticipate that Quint testing will uncover more bugs than traditional Go testing? How do you compare the tests generated by Quint with the existing tests in the nmt repository? This touches on the three modes I was describing: a random simulation, symbolic simulation, and full verification. Random simulation would probably work equally well as the current fuzz_test.go. The other two modes (still to be added to Quint, by connecting it to the Apalache model checker) provide better coverage, by not depending on sheer luck to stumble upon an interesting specified(!) scenarios, but guaranteeing to find them if they exist.

The point I wanted to make was that as the codebase evolves (e.g., parameters of a certain function are changed, or the logic changes making some test-cases behave differently etc.), we do not have to re-write a large number of manually written test cases. Instead, we could change the model and the go function which reads the generated test cases and executes them, and have again a huge number of test cases re-generated.

I see your point, thanks for clarification! Currently, the test data generated in this PR is using the random simulation, is that correct? How can we use the other two modes?

codecov[bot] commented 1 year ago

Codecov Report

Merging #163 (b6c7c58) into master (6854976) will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #163   +/-   ##
=======================================
  Coverage   95.39%   95.39%           
=======================================
  Files           5        5           
  Lines         565      565           
=======================================
  Hits          539      539           
  Misses         15       15           
  Partials       11       11           
ivan-gavran commented 1 year ago

I like this discussion :)

[...] Based on your explanation, the Quint model for nmt simplifies and automates the process of generating test data, which has two primary benefits (echoing what you said essentially): 1) it allows for the identification of hidden corner cases that would otherwise be difficult to discover manually; and 2) it reduces the workload for reproducing test data and refactoring code when new features are added.

However, as the current state of the PR, the test data generation follows a random simulation, hence, coverage-wise (referring to the first point above) the generated test data is similar to ones generated in the fuzz_test.go, right?

That is correct!

I would also appreciate it if you could clarify whether Quint can be used as a specification for nmt, I would be curious to learn more about how that would work. For example, could the quint model (compared to the current go implementation) be leveraged to facilitate the implementation of nmt in a different language? My understanding is that it cannot replace a formal english specification, and we should not rely on Quint as a replacement of specifications. Would like to know your point of view on this.

I believe Quint could be used as a specification for nmt. The most promising approach would be to interleave English descriptions (for intuition) with Quint expressions (for precision). For the NMT spec, the description of treating the ignoreMaxNamespace flag (here) caused some confusion for me as a reader. If that part was instead replaced by a Quint function, that would make it clear and would allow me to inspect the spec for different inputs. One example of such approach is an (unfinished) intro to the bank module - linking in particular descriptions of ViewKeeper's functions.

More generally, I see Quint code as a replacement for pseudocode - and now you have specs with pseudocode you can simulate, automatically reason about, or connect to test cases.

I'm interested to know how to generate test data that intentionally fail for specific reasons using quint and how to incorporate that data into the tests of the nmt_test.go file. I assume that with your upcoming commit, this process will become clearer. My understanding is that, as each test in the nmt_test.go file requires data to be corrupted in a particular way, it seems that we need to add all those corruption scenarios to the quint model under the nmt_test module. Then, using the go adapter, we can parse the JSON file containing the test data, examine the type of corruption in each case, and call the corresponding test function from nmt_test.go (or alternatively make each test in the nmt_test.go read the JSON file and finds data that is relevant to its test). Is this the intended workflow for using the resulting test data to replace the hardcoded test vectors in nmt_test.go?

Also, I think quint-generated test data are not meant for unit testing of the functions? For example, tests such as

https://github.com/celestiaorg/nmt/blob/05017fc627d0cdb0aaba84293141f67010485648/hasher_test.go#L67

(although, I do not see any immediate reason that this would not be possible). Can you please elaborate on this?

In the latest commit, I wrote the spec and the Go adapter that mimic the tests from TestNamespacedMerkleTree_ProveNamespace_Ranges_And_Verify. You will notice that the way in which I generate different scenarios (following the scenarios from the original test) is by defining an invariant. For instance, for the test case described by

{
            "three leaves and not found but with range", 2,
            append(repeat(generateLeafData(2, 0, 1, []byte("_data")), 2), newNamespaceDataPair([]byte{1, 1}, []byte("_data"))),
            []byte{0, 1},
            2, 3,
            false,
        },

the invariant is defined by

     val fiveLeavesAndNotFoundButWithinRange = 
        val leavesSet = listToSet(leaves_namespace_idx_v)
        and{
            length(leaves_namespace_idx_v) == 5,            
            not(leavesSet.contains(namespace_v)),
            min(leavesSet) < namespace_v,
            max(leavesSet) > namespace_v
        }

The difference between the original test description and Quint description is that the original description is simultaneously describing a scenario and technically achieving it. In the commit, the Quint part only describes what a test scenario should be about, while the technical details (e.g., byte representation) are left to the adapter and exact data values are generated (and stored in a .json file, to be consumed by the adapter).

After negating the invariant, a test case is found as a violation of that (negated) invariant.

This example illustrates that one does not have to explicitly write desired corruption cases. What we do instead is describe the data by an invariant.

You are right that Quint is not meant primarily for units tests, but can be used for it, too (as the example illustrates).

The point I wanted to make was that as the codebase evolves (e.g., parameters of a certain function are changed, or the logic changes making some test-cases behave differently etc.), we do not have to re-write a large number of manually written test cases. Instead, we could change the model and the go function which reads the generated test cases and executes them, and have again a huge number of test cases re-generated.

I see your point, thanks for clarification! Currently, the test data generated in this PR is using the random simulation, is that correct? How can we use the other two modes?

Yes, the test data is generated using random simulation. The other two modes are in the coming soon stage: they depend on connecting Quint to the Apalache model checker (here is the tracking issue)

ivan-gavran commented 1 year ago

In the previous comment, I mentioned the new commit, which mimics tests from TestNamespacedMerkleTree_ProveNamespace_Ranges_And_Verify, but written using Quint. I'd like to add some more details about how that is done:

, Quint defines it by

val twoLeavesAndFound = 
        and{
            length(leaves_namespace_idx_v) == 2,
            val leavesSet = listToSet(leaves_namespace_idx_v)
            leavesSet.contains(namespace_v)
        }

Finally, we negate the property and use the negated property as an invariant, whose violation is then a test case. Concreately, to obtain the test case data, we can run quint run --max-samples=1000 --max-steps=10 tests.qnt --invariant=twoLeavesAndFoundTest

staheri14 commented 1 year ago

Yes, the test data is generated using random simulation. The other two modes are in the coming soon stage: they depend on connecting Quint to the Apalache model checker (https://github.com/informalsystems/quint/issues/12 is the tracking issue)

Would the current quint model for nmt be applicable for the other two simulation modes or would a different quint model be required for those modes?

ivan-gavran commented 1 year ago

Would the current quint model for nmt be applicable for the other two simulation modes or would a different quint model be required for those modes?

The same model can be used for all three modes. The difference is only in how they are looking for an invariant violation (and, therefore, what guarantees they can provide):

ivan-gavran commented 1 year ago

In this commit I corrected a mistake that was introduced while prettifying the spec, which resulted in generation of empty test-data.

I also added comments around Quint functions and did some go linting. Marking now this PR as ready.

staheri14 commented 1 year ago

@ivan-gavran Thanks a lot for addressing the comments, they look good to me! The issue with gobencher is currently being looked into. As soon as it's resolved, I'll let you know. At that point, I'll kindly request you merge the master branch into your branch one last time. After that, we should be ready to merge your PR :)

staheri14 commented 1 year ago

@ivan-gavran Would you mind updating your branch with the master branch for the last time, then we are good to merge this PR (given that CIs are fine excluding the gobencher). Thanks!