filecoin-project / go-f3

Golang implementation of Fast Finality for Filecoin (F3)
Apache License 2.0
12 stars 7 forks source link

Investigate `subscriber too slow` errors in pubsub #759

Open masih opened 4 days ago

masih commented 4 days ago

Observed in passive testing on mainnet for filecoin/16 network:

{"level":"info","ts":"2024-11-28T21:17:37.595+1100","logger":"pubsub","caller":"go-libp2p-pubsub@v0.11.0/pubsub.go:981","msg":"Can't deliver message to subscription for topic /f3/granite/0.0.2/filecoin/16; subscriber too slow"}
{"level":"info","ts":"2024-11-28T21:17:37.595+1100","logger":"pubsub","caller":"go-libp2p-pubsub@v0.11.0/pubsub.go:981","msg":"Can't deliver message to subscription for topic /f3/granite/0.0.2/filecoin/16; subscriber too slow"}
{"level":"info","ts":"2024-11-28T21:17:37.595+1100","logger":"pubsub","caller":"go-libp2p-pubsub@v0.11.0/pubsub.go:981","msg":"Can't deliver message to subscription for topic /f3/granite/0.0.2/filecoin/16; subscriber too slow"}

This indicates that the processing loop is taking too long to process received messages. ida metrics at the corresponding time interval in which this occured: https://grafana.f3.eng.filoz.org/d/edsu1k5s7gtfkb/f3-passive-testing?orgId=1&var-network=mainnet&var-instance=ida.f3.eng.filoz.org:80&from=1732789053994&to=1732789928949&tab=transform

masih commented 4 days ago

We should set Tracer in pubsub and log UndeliverableMessage; such messages end up in that tracer call.

masih commented 4 days ago

The buffer size of 20 here may be too small for large networks.

masih commented 4 days ago

With buffer sizes bumped to 512 both for internal buffer size and pubsub subscriber buffer size seem to have fixed the issue at network scale of 40%.

The true fix is to make the processing loop faster, or internally buffer messages in a more sophisticated way inside GPBFT (and indeally hook it up to validation/internal state to drop messages early if they are not processed yet , were valid when buffered, and became useless)

masih commented 5 hours ago

There is a flurry of these messages every time we re-bootstrap a passive testing network.

I believe the cause could also be that the gpbft runner gets paused when manifest changes and resumes again close to bootstrap of the new manifest. The runner is where the consume loop of topic subscription lives. Meaning, when paused no messages from the previous network topic gets consumed, and hence the "queue full" error.

If on archioz we also only observe this error during rebootstrap then I believe this is a non-issue. Though will not hurt to bump the limits anyway.