filecoin-station / spark-stats

API exposing public statistics about Spark
Other
0 stars 1 forks source link

Expose per-round measurement count #214

Closed juliangruber closed 2 months ago

juliangruber commented 2 months ago

Required for https://github.com/filecoin-station/spark-api/pull/385

In addition to measurements/daily, add measurements/round/:n or measurments/per-round/:n.

bajtos commented 2 months ago

We are storing daily values to keep the database size reasonable and the aggregate queries fast.

I propose to introduce a new table for these per-round measurement counters.

Questions to consider:

You will need to start in spark-evaluate - create a DB schema migration script to add the new table and modify lib/public-stats.js to populate it.

juliangruber commented 2 months ago

Do we need/want to keep historical per-round measurement values for https://github.com/filecoin-station/spark-api/pull/385? If not, then we can keep only the data for the last X rounds, e.g. 72 rounds (one day).

Great point! In the current algorithm, we only ever need the previous round. From that perspective, if we can make the API only expose stats for the previous round, then we can make the consumer even simpler: It doesn't need to know the index of the previous round.

To me this is a win on storage and ease of use 👍

Which measurements do you want to count - accepted measurements only, measurements in majority, or all measurements?

I was wondering about this as well! Honest nodes will be more likely to respect the task count set out by spark-api -> majority measurements > accepted measurements > all measurements. I think majority measurements is the best one. Wdyt?

bajtos commented 2 months ago

On the second thought, I am not very happy about tracking measurements-per-round in spark-stats. It requires keeping three repos in sync (spark-api, spark-stats, spark-evaluate).

Have you considered the following alternative?

So instead of going through spark-publish->spark-evaluate->spark-stats->spark-api, we can keep everything inside spark-api repository & database. We don't even need to expose this in a public API.

Of course, this works only if we want to use "total measurements".

From that perspective, if we can make the API only expose stats for the previous round, then we can make the consumer even simpler: It doesn't need to know the index of the previous round.

I think this is going to be problematic - it takes a while until the measurements go through the smart contract and spark-evaluate. Depending on when exactly your task-limit algorithm executes, it will see data from rounds N-1 or N-2. I am concerned this will make troubleshooting difficult. It will also complicate caching.

Which measurements do you want to count - accepted measurements only, measurements in majority, or all measurements?

I was wondering about this as well! Honest nodes will be more likely to respect the task count set out by spark-api -> majority measurements > accepted measurements > all measurements. I think majority measurements is the best one. Wdyt?

IMO, this depends on what we are trying to achieve.

My understanding is that we want to limit the load created by the Station network. In that light, we must limit the total number of measurements performed.

juliangruber commented 2 months ago

My understanding is that we want to limit the load created by the Station network. In that light, we must limit the total number of measurements performed. So instead of going through spark-publish->spark-evaluate->spark-stats->spark-api, we can keep everything inside spark-api repository & database. We don't even need to expose this in a public API.

Of course, this works only if we want to use "total measurements".

Right! Ok, total measurements it is.

Cool, I'll add this inside spark-api internally :)