Verify the accuracy of visor data in the analysis schema

olizilla commented 3 years ago

Description

How would we verify that a complete extraction of chain data via visor would allow us to run aggregation queries that gave the same answer as the filecoin state tree at a given epoch? Is that even a sensible goal for the analysis db? Should users be able to run financial reports on the analysis db and expect to get the same answer as if they had queried a lotus node?

Is our goal reasonably high-precision (for graphing trends) or perfect convergence (for financial reports)? We should figure out what is possible and make a decision on it to help inform where to put our efforts.

Either way it would be useful for us to have a verification process that allows us to put a figure on how accurate the collected data is. We could show some aggregations distance from the chain state or similar. Other ideas are here needed, and this issues is intended to drive discussion at this stage.

Acceptance criteria

We can put a metric on how accurate the collected data is over a given window.
- this would be useful for automatically catching issues like #354
Some tables can only be meaningfully aggregated over if they have full extractions from gensis, and we should identify which those are.
We publish a position on what sorts of analysis are appropriate for sentinel data. (trends vs transaction reports)

Where to begin

Discuss it!

placer14 commented 3 years ago

Is that even a sensible goal for the analysis db?

I would like visor, as configured for the analysis environment, to be primarily a transform function over on-chain state to produce data for a specific schema. While parts of the schema definitely do return answers as though they were a live lotus node, there are some exceptions which break this assumption. (We have schemas such as miner_sector_events which are an enumeration of state changes which requires a user to understand its design to get the latest-state answer a live node might provide as opposed to earlier-state answers which also reside in the dataset.)

More, there are some semantics that certain APIs assume which may cause it to deviate from actual data in the DB. (Such as aggregations over time which have been averaged, causing loss of signal but suitable for operational spot-checks for lack of a handy Sentinel DB to query.) These semantics could easily change which may cause overhead for any tests we write.

Should users be able to run financial reports on the analysis db and expect to get the same answer as if they had queried a lotus node? ... Is our goal reasonably high-precision (for graphing trends) or perfect convergence (for financial reports)?

Visor data has been actively applied in both use cases, so I believe we necessarily should default to the highest available precision as it is feasible for us to do so. (Note: There has not yet been an infeasibility for visor which has arised to force our precision to be sacrified AFAIK. Simply mentioning feasibility to relax the requirement of "strictness of precision at all costs". Drone schema does have some, for example, in tracking mempool state.) I think our data may have areas where it is not "financially perfect" but "graph suitable" so the length we go to verify accuracy should be case-by-case on a per schema basis.

Either way it would be useful for us to have a verification process that allows us to put a figure on how accurate the collected data is. We could show some aggregations distance from the chain state or similar. Other ideas are here needed, and this issues is intended to drive discussion at this stage.

I support this idea for the majority of visor data. Limitations might be for schema does not have a reciprocal node API endpoint to query a verification against. I see part of Sentinel's value prop as "confidence in data collected". Having a report validate the accuracy would go a long way to support that.

Some tables can only be meaningfully aggregated over if they have full extractions from gensis, and we should identify which those are.

Stating explicitly what you're implying here, there are properties that certain schema must intrinsically demonstrate independent of a lotus node and should be included in our checks. As mentioned here, that a "complete" extraction from genesis is contained within the set. But also that the analysis DB has no chain forks in the set. Or that recipts are 1:1 w their message counterparts. Or that uniqueness constraint has been maintained.

Would like to see small enumeration of proposed checks and how they might be validated for more concreteness around this discussion. Diversity in validation approaches in the list are welcome to get a bigger picture of which components might be involved in the design. For example, the above example might be validated with a simple SQL query. More complex scenarios may involve a CAR to be the extraction source for visor, which is then started as a backend w lotus whose API is consumed for verification against the resulting schema.

Final thought: consider the vector spike that @frrist has started in #370. The purpose is for benchmarking, but think that there could also be smoke tests which emerge that might also benefit from generated accuracy reports.

frrist commented 3 years ago

@iand I'm going to leave this assigned to you and assume it will be covered by the data publishing task you are undertaking.

filecoin-project / lily