Syndica / sig

a Solana validator client implementation written in Zig
https://syndica.io/sig
Apache License 2.0
219 stars 34 forks source link

gossip: Random test failure in gossip test "test filtering values works" #157

Open InKryption opened 5 months ago

InKryption commented 5 months ago

Description

At random, this line in the referenced test will fail:

https://github.com/Syndica/sig/blob/a5d1f34d8a93f0d101e58e2a5aa2a8a5126a632f/src/gossip/pull_response.zig#L143

I currently do not have a stacktrace demonstrating this behavior, because I do not know the circumstances under which this failure occurs.

How to Reproduce the Bug

run zig build test -Dfilter="gossip.pull_response: test filtering values works" and pray

Additional Context

It should be noted that the function being tested, named filterSignedGossipDatas, currently instantiates and uses a PRNG seeded with std.time.milliTimeStamp(), which could explain its seemingly random and unreproducible behavior.

What would be ideal is capturing a stack trace of this occurring, as well as the exact or approximate timestamp at which this failure occurs, to diagnose what random values can cause such a failure, and identify what circumstances they lead to which must be guarded against.

At the time of writing I plan on pushing a change which would help in the above goals, without needing to actively search for this mythical seed.

InKryption commented 5 months ago

Sorry, I'd like to keep this open until we actually have it solved. The code for helping us solve it is in place, it just needs to randomly occur at some point, or the code needs to fundamentally change such that we can guarantee it won't.

0xNineteen commented 5 months ago

just hit it - The failing seed is: '1718031575764'

InKryption commented 5 months ago

Can confirm that using '1718031575764' as the seed consistently reproduces the failure state.

dnut commented 5 months ago

another one: (first in list below)

EDIT: i'll keep adding them to this list as i see them:

InKryption commented 5 months ago

Can't reproduce the behavior using '1718158590375'.

InKryption commented 5 months ago

I just noticed there's another non-deterministic element, the filter parameter: https://github.com/Syndica/sig/blob/488fb37f00d2fd46a1dbb92d315b04d8f53fe903/src/gossip/pull_response.zig#L136

generated here: https://github.com/Syndica/sig/blob/488fb37f00d2fd46a1dbb92d315b04d8f53fe903/src/gossip/pull_response.zig#L102-L110

is random as well, because of this chain of code:

which means there's another source of RNG also based on std.time.milliTimestamp.

I think we should do an audit on this, and overall replace all uses of std.time.milliTimestamp as a seed that aren't obviously at the highest level of code, with a std.Random interface parameter, or a pointer to a concrete PRNG implementation if specific random behavior is desired/required.

0xNineteen commented 5 months ago

yeah pulling out the random seeds to deterministic would be good - we can create a ticket - kind of low priority right now

https://github.com/orgs/Syndica/projects/2/views/1?pane=issue&itemId=67168458

InKryption commented 4 months ago

New failing seed caught in CI: https://github.com/Syndica/sig/actions/runs/9811929838/job/27094994026?pr=189 Seed number is '1720202203195'.

InKryption commented 4 months ago

https://github.com/Syndica/sig/actions/runs/9897406056/job/27341819918#step:4:1

We got another one: '1720726415143'

dnut commented 2 months ago

This is still a problem. I saw a failure on the latest commit for seed 1725458185024. See my comment above for a list I'll keep updating as I see more failures.