flaky test: `allow_multiple_inputs_and_outputs_in_block`

aszepieniec commented 5 months ago

Test allow_multiple_inputs_and_outputs_in_block in archival_state.rs fails some of the time.

To decrease the probability of failure:

run the test with a single thread: cargo test allow_multiple_inputs_and_outputs_in_block -- --test-threads 1

To increase the probability of failure:

uncomment the print statements in the test body in between block_2.accumulate_transaction(tx_from_alice); and
```
    block_2.accumulate_transaction(tx_from_bob);
```

Commit c239e362683126149b3e25d360e485eb8a03a4c7 turns the failing unwrap into an unwrap_or_else with a panic! so as to display debug information.

I'm not sure how to fix this bug. I decided to shelve this bug hunt for the time being and devote attention to things that I can do. Feel free to take it off my plate.

aszepieniec commented 5 months ago

To increase also try: cargo test allow_multiple_inputs_and_outputs_in_block -- --nocapture.

Sword-Smith commented 5 months ago

Got this error when running the allocate_sufficient_input_funds_test test as well. Test behavior was also flaky on my machine.

This happened while my machine was running a node.

thread 'models::state::wallet::wallet_tests::allocate_sufficient_input_funds_test' panicked at src/util_types/mutator_set/mutator_set_kernel.rs:191:21:
Can't get chunk index 2 from dictionary! dictionary: {}
stack backtrace:
   0: rust_begin_unwind
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/std/src/panicking.rs:645:5
   1: core::panicking::panic_fmt
             at /rustc/82e1608dfa6e0b5569232559e3d385fea5a93112/library/core/src/panicking.rs:72:14
   2: neptune_core::util_types::mutator_set::mutator_set_kernel::MutatorSetKernel<H,M>::remove_helper::{{closure}}
             at ./src/util_types/mutator_set/mutator_set_kernel.rs:191:21

dan-da commented 5 months ago

Okay, so the panic is occuring in mutator_set_kernel.rs, remove_helper() method:

thread 'models::state::archival_state::archival_state_tests::allow_multiple_inputs_and_outputs_in_block' panicked at src/util_types/mutator_set/mutator_set_kernel.rs:195:21:
Can't get chunk index 0 from dictionary! dictionary: {}

which is this call to get_mut():

            let relevant_chunk = new_target_chunks
                .dictionary
                .get_mut(&chunk_index)

I did a little basic debugging. I added these println to remove_helper():

        println!("_debug_: new_target_chunks.dict keys: {:?}", new_target_chunks.dictionary.keys());
        println!("_debug_: chunkindices_to_indices_dict keys: {:?}", chunkindices_to_indices_dict.keys());
        println!("_debug_: batch_index: {}", batch_index);

I logged both a successful test run, and a run that panics. The final output for the failing test run is:

_debug_: new_target_chunks.dict keys: []
_debug_: chunkindices_to_indices_dict keys: [253, 243, 24, 249, 68, 214, 192, 100, 181, 87, 36, 185, 121, 173, 177, 50, 52, 150, 128, 86, 119, 80, 143, 0, 206, 175, 26, 226, 92, 106, 1, 225, 204, 42, 111, 197, 187, 221, 136, 33, 37, 235, 144]
_debug_: batch_index: 1

Look carefully, and we spot the 0 entry, between 143 and 206. This is the cause of the crash. So either 0 should not be present here, or remove_helper() has a bug and should handle the zero.

I verified that a 0 never appears during a successful test run:

$ grep _debug_ /tmp/test.out | grep ", 0," | wc -l
0

$ grep _debug_ /tmp/test2.out | grep ", 0," | wc -l
1

(test1.out is the successful run logfile, test2.out is the run that panic'd)

Something else that jumped out at me is that new_target_chunks.dictionary is always empty. This is always true during both test runs.

The 0 chunk_index key is a problem because of this code in remove_helper():

            if chunk_index >= batch_index {
                ...

                continue;
            }

batch_index is 1. So for all other values of chunk_index, this condition is true and we continue, avoiding the dictionary lookup. When chunk_index is 0, the condition is false, and we proceed to:

            let relevant_chunk = new_target_chunks
                .dictionary
                .get_mut(&chunk_index)

However, we have seen that new_target_chunks.dictionary is always empty. thus, a panic.

I have no working knowledge of these data structures and surrounding code, so it is unclear to me where to go from here. I am suspicious of:

where is the 0 coming from in the first place? Is it is a legit chunk_index?
why is batch_index 1, instead of 0? Could this be an off-by one error?
Is this check correct: chunk_index >= batch_index? Perhaps instead: chunk_index > batch_index
why is new_target_chunks.dictionary always empty?
are we generating random values someplace in incorrect fashion, such as in the test case?

aszepieniec commented 5 months ago

I can answer some of these questions.

0 is a valid chunk_index.
The batch_index tracks which epoch an item was added to the mutator set in. Right now the BATCH_SIZE is 8, so every 8 additions the batch_index increments by 1.
Every BATCH_SIZE-many additions, the sliding window into the Bloom filter slides by STEP_SIZE-many positions. Whenever this happens a chunk of the formerly active window is absorbed into the MMR. Now all removal records that list an index in this chunk need to record an authentication path in the MMR.

dan-da commented 5 months ago

@aszepieniec, Let's consider this code block together:

    for (chunk_index, indices) in chunkindices_to_indices_dict {
            if chunk_index >= batch_index {
                // index is in the active part, so insert it in the active part of the Bloom filter
                for index in indices {
                    let relative_index = (index - active_window_start) as u32;
                    self.swbf_active.insert(relative_index);
                }

                continue;
            }

            // If chunk index is not in the active part, insert the index into the relevant chunk
            let new_target_chunks_clone = new_target_chunks.clone();
            let relevant_chunk = new_target_chunks
                .dictionary
                .get_mut(&chunk_index)

along with this set of input chunk_index values:

[253, 243, 24, 249, 68, 214, 192, 100, 181, 87, 36, 185, 121, 173, 177, 50, 52, 150, 128, 86, 119, 80, 143, 0, 206, 175, 26, 226, 92, 106, 1, 225, 204, 42, 111, 197, 187, 221, 136, 33, 37, 235, 144]

Given that batch_index is 1, this check:

if chunk_index >= batch_index

will cause the loop to continue for every entry in that set except 0.

That seems odd/suspicious to me. Does it seem correct to you? if so, why?

Meanwhile new_target_chunks.dictionary is empty. And we call get_mut() on it, just expecting the 0 index to be there.

Does that seem correct? if so, why?

The batch_index tracks which epoch an item was added to the mutator set in. Right now the BATCH_SIZE is 8, so every 8 additions the batch_index increments by 1.

ok, well given the present code all 3 of these statements can never be simultaneously true:

chunk_index == 0
batch_size > 0
new_target_chunks.dictionary is empty. (or at least does not contain key 0)

If in fact, those 3 things can be simultaneously true, then the above code is in error.

So, do you have an idea which of the three statements is in error? Or alternatively the code?

aszepieniec commented 5 months ago

will cause the loop to continue for every entry in that set except 0.

That seems odd/suspicious to me. Does it seem correct to you? if so, why?

This seems correct to me. All the chunk indices larger than 0 point into the active window, so there is no need to record MMR-membership-proofs for SWBF-indices with those chunk-indices.

If in fact, those 3 things can be simultaneously true, then the above code is in error.

(I assume you meant batch_index > 0 rather than batch_size > 0.)

The first two bullet points already are mutually incompatible. If some chunk_index == 0 then the batch_index == 0. More generally, I think batch_index <= $\min$ ( chunk_index ) always holds.

What baffles me is how this test can be flaky in the first place. If it were an error isolated to the mutator set then we should be able to reduce it to a minimalistic deterministic failing test case. I was under the impression that the test's failure was a consequence of having multiple interacting threads, but maybe that impression is just false.

aszepieniec commented 5 months ago

The first two bullet points already are mutually incompatible. If some chunk_index == 0 then the batch_index == 0. More generally, I think batch_index <= ( chunk_index ) always holds.

Actually, this is misleading. The variable batch_index in this scope denotes the mutator set's current batch index, not the batch index of the record being removed. (We don't know this latter batch index, at least not exactly, that's the whole point of using a mutator set to begin with.)

With that in mind, there is no contradiction at all in batch_index being larger than chunk_index. The problem seems to be that the target chunks dictionary is empty when it shouldn't be.

Neptune-Crypto / neptune-core

flaky test: `allow_multiple_inputs_and_outputs_in_block` #100