data-preservation-programs / RetrievalBot

A scalable framework to perform retrieval testing over Filecoin network
Other
13 stars 3 forks source link

RB data quality - outlier data point detection #19

Open xmcai2016 opened 1 year ago

xmcai2016 commented 1 year ago

Now that we have a few more Reputation Bots on the horizon - I suggest we implement an outlier detection mechanism. The goal is to weed out untrustworthy data (either intentional abuse or unintentional mistakes). Outlier data points should be filtered out when we output collective Retrieval Bot data to dashboards, GitHub bot, and any other downstream consumer. We can initially define an outlier as a data point that is 10% deviated from the median data point for the same sp_id measured by different Retrieval Bot instances. Open to suggestions on alternative definitions / definition can be tweaked on the fly with more empirical data. We should also keep track of the source of the outliers which helps us root cause any skews in data collected.