Initial passive testing is happy but not next ones started with 10m succession

masih commented 2 days ago

Critical question: Why is it that the first test in the morning always seem to work nice, and successive tests seem to run not as good?

Looking at the pubsub settings we forked over from Lotus, there are... a lot of questionable decisions that seem to be rooted in pre-F3 filecoin network behaviour (e.g. this).

I wonder if change in passive testing network causes some loss of mesh or unfair peer scoring such that gossip sub mesh becomes ineffective to the point where messages simply do not propagate fast enough. Take invalid message scoring for example, when networks change it is inevitable that some messages arrive rom previous network that would be considered invalid. We also observe spike in invalid message error in validation flow documented here at initial instance.

So...

Could it be that the ineffective gossipsub at least to some extent is the result of change in network during passive testing?
Are there parameters set in pubsub that unfairly reduce ranking or negatively impact the mesh by deeming what passive testing does (change in topic, resubscrption, dropping messages between networks) ?
Could it be that the current pubsub settings even within a single passive testing network impact peer ranking when instances progress e.g. due to high rate of validation ignores?

masih commented 2 days ago

And looks like lotus (and by extension Observer, F3, etc.) retains negative scoring for 6 hours. This is a setting set at top level pubsub. I assume it affects the pubsub instance, i.e. all topics in its lifetime.

rjan90 commented 2 days ago

Anecdotally I see a lot of PeerIDs with the exact same really high negative score:

lotus net scores
12D3KooWBPyrDyrTRchikR56W21cW3dQ5YRDeAgCZvPjw7jopfuU, -1795600.000000
12D3KooWBNh4V7JeEvYLKvSbGeMMMFJyB3vavEyEipqNYaZh9cNS, -1795600.000000
12D3KooWBNMVxsBq4T5T8qX8E1FWhfyVULDJ56a3mE1m6r3bEJ8f, -1795600.000000
12D3KooWAy4R5DgHcAuP7Z6CJyesQXkNPfoBFShMtdMtg1z3dhWS, -1795600.000000
12D3KooWAmPdJJcrNQ9qL4Dtj229kJ2VngPtrEmz6fd7duc6N8Q4, -1795600.000000
12D3KooWAewsJcXcVoEhCwfvD7zWwCPae8WtVvcL8nvy84HdNivL, -1795600.000000
12D3KooWAY9Vq9wzqRjzaoKheXPDVf9YCf1GpQ32V4mtjtxAaHPW, -1795600.000000
12D3KooWAPsAXsxBpuRJbjiX7cFzNsA8A1UZe8ikWsbgxZ7DDu5Y, -1795600.000000
12D3KooWAEZaEAwxco3Coho2c4KESS5Q868NYhXzSHAXdvwomYAt, -1795600.000000

A total of 139 on my node with the exact same negative score, out of a total of:

lotus net scores | wc -l
2045

Total number of PeerIDs that have negative scores is:

rjan90 commented 2 days ago

For clarity I also grepped for the ones that subscribe to F3, and most have 0 scores - with some occasional negative ones, but not the high negative score as ^^

{"ID":"12D3KooW9sCwBYPVGr9T7A5DMzk8qF4wdGtTGSREK7kMLdJDBLR6","Score":{"Score":0,"Topics":{"/f3/granite/0.0.2/filecoin/21":{"TimeInMesh":0,"FirstMessageDeliveries":0,"MeshMessageDeliveries":0,"InvalidMessageDeliveries":0}},"AppSpecificScore":0,"IPColocationFactor":0,"BehaviourPenalty":0}}
{"ID":"12D3KooW9rUCW2eEmbZsGarEBzdh7RwqZXzVhm5yW4GHpM4PxGLV","Score":{"Score":0,"Topics":{"/f3/granite/0.0.2/filecoin/21":{"TimeInMesh":0,"FirstMessageDeliveries":0,"MeshMessageDeliveries":0,"InvalidMessageDeliveries":0}},"AppSpecificScore":0,"IPColocationFactor":0,"BehaviourPenalty":0}}
{"ID":"12D3KooW9qsRsJmXkgYuyJnZNDwpB75Lhs1dw6myiNFDTLgwgbQA","Score":{"Score":0,"Topics":{"/f3/granite/0.0.2/filecoin/21":{"TimeInMesh":0,"FirstMessageDeliveries":0,"MeshMessageDeliveries":0,"InvalidMessageDeli
[14:16](https://filecoinproject.slack.com/archives/C077HAHSP8U/p1732886203372249?thread_ts=1732883402.397439&cid=C077HAHSP8U)

And the ones with extremly high negative scores are IPColocationFactor

{"ID":"12D3KooWBNMVxsBq4T5T8qX8E1FWhfyVULDJ56a3mE1m6r3bEJ8f","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAy4R5DgHcAuP7Z6CJyesQXkNPfoBFShMtdMtg1z3dhWS","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAmPdJJcrNQ9qL4Dtj229kJ2VngPtrEmz6fd7duc6N8Q4","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAewsJcXcVoEhCwfvD7zWwCPae8WtVvcL8nvy84HdNivL","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAY9Vq9wzqRjzaoKheXPDVf9YCf1GpQ32V4mtjtxAaHPW","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAPsAXsxBpuRJbjiX7cFzNsA8A1UZe8ikWsbgxZ7DDu5Y","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWAEZaEAwxco3Coho2c4KESS5Q868NYhXzSHAXdvwomYAt","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}}
{"ID":"12D3KooWA4kybwSTq57KfMxJ4unVPbFTXTxRpe1S5HcKALRvu2FY","Score":{"Score":-1876900,"Topics":null,"AppSpecificScore":0,"IPColocationFactor":18769,"BehaviourPenalty":0}

Another test after a prolonged pause should be ran to rule out peer scares, but it does not seem that peerIDs get negatively scored.

filecoin-project / go-f3

Initial passive testing is happy but not next ones started with 10m succession #765