Open InKryption opened 5 months ago
Sorry, I'd like to keep this open until we actually have it solved. The code for helping us solve it is in place, it just needs to randomly occur at some point, or the code needs to fundamentally change such that we can guarantee it won't.
just hit it - The failing seed is: '1718031575764'
Can confirm that using '1718031575764' as the seed consistently reproduces the failure state.
another one: (first in list below)
EDIT: i'll keep adding them to this list as i see them:
Can't reproduce the behavior using '1718158590375'.
I just noticed there's another non-deterministic element, the filter parameter: https://github.com/Syndica/sig/blob/488fb37f00d2fd46a1dbb92d315b04d8f53fe903/src/gossip/pull_response.zig#L136
generated here: https://github.com/Syndica/sig/blob/488fb37f00d2fd46a1dbb92d315b04d8f53fe903/src/gossip/pull_response.zig#L102-L110
is random as well, because of this chain of code:
buildGossipPullFilters
calls: https://github.com/Syndica/sig/blob/94f2212ae8ccc9d99ff1249b1b1bfa63880d9784/src/gossip/pull_request.zig#L37GossipPullFilterSet.init
calls: https://github.com/Syndica/sig/blob/94f2212ae8ccc9d99ff1249b1b1bfa63880d9784/src/gossip/pull_request.zig#L102-L107Bloom.random
does this: https://github.com/Syndica/sig/blob/94f2212ae8ccc9d99ff1249b1b1bfa63880d9784/src/bloom/bloom.zig#L94-L101which means there's another source of RNG also based on std.time.milliTimestamp
.
I think we should do an audit on this, and overall replace all uses of std.time.milliTimestamp
as a seed that aren't obviously at the highest level of code, with a std.Random
interface parameter, or a pointer to a concrete PRNG implementation if specific random behavior is desired/required.
yeah pulling out the random seeds to deterministic would be good - we can create a ticket - kind of low priority right now
https://github.com/orgs/Syndica/projects/2/views/1?pane=issue&itemId=67168458
New failing seed caught in CI: https://github.com/Syndica/sig/actions/runs/9811929838/job/27094994026?pr=189 Seed number is '1720202203195'.
https://github.com/Syndica/sig/actions/runs/9897406056/job/27341819918#step:4:1
We got another one: '1720726415143'
This is still a problem. I saw a failure on the latest commit for seed 1725458185024. See my comment above for a list I'll keep updating as I see more failures.
Description
At random, this line in the referenced test will fail:
https://github.com/Syndica/sig/blob/a5d1f34d8a93f0d101e58e2a5aa2a8a5126a632f/src/gossip/pull_response.zig#L143
I currently do not have a stacktrace demonstrating this behavior, because I do not know the circumstances under which this failure occurs.
How to Reproduce the Bug
run
zig build test -Dfilter="gossip.pull_response: test filtering values works"
and prayAdditional Context
It should be noted that the function being tested, named
filterSignedGossipDatas
, currently instantiates and uses a PRNG seeded withstd.time.milliTimeStamp()
, which could explain its seemingly random and unreproducible behavior.What would be ideal is capturing a stack trace of this occurring, as well as the exact or approximate timestamp at which this failure occurs, to diagnose what random values can cause such a failure, and identify what circumstances they lead to which must be guarded against.
At the time of writing I plan on pushing a change which would help in the above goals, without needing to actively search for this mythical seed.