Open masih opened 4 days ago
We should set Tracer in pubsub and log UndeliverableMessage
; such messages end up in that tracer call.
With buffer sizes bumped to 512 both for internal buffer size and pubsub subscriber buffer size seem to have fixed the issue at network scale of 40%.
The true fix is to make the processing loop faster, or internally buffer messages in a more sophisticated way inside GPBFT (and indeally hook it up to validation/internal state to drop messages early if they are not processed yet , were valid when buffered, and became useless)
There is a flurry of these messages every time we re-bootstrap a passive testing network.
I believe the cause could also be that the gpbft runner gets paused when manifest changes and resumes again close to bootstrap of the new manifest. The runner is where the consume loop of topic subscription lives. Meaning, when paused no messages from the previous network topic gets consumed, and hence the "queue full" error.
If on archioz we also only observe this error during rebootstrap then I believe this is a non-issue. Though will not hurt to bump the limits anyway.
Observed in passive testing on mainnet for
filecoin/16
network:This indicates that the processing loop is taking too long to process received messages.
ida
metrics at the corresponding time interval in which this occured: https://grafana.f3.eng.filoz.org/d/edsu1k5s7gtfkb/f3-passive-testing?orgId=1&var-network=mainnet&var-instance=ida.f3.eng.filoz.org:80&from=1732789053994&to=1732789928949&tab=transform