Discussion: expectations for data replication - Githubissues

filecoin-project / Allocator-Governance

7 stars 24 forks source link

Discussion: expectations for data replication #86

Open willscott opened 1 month ago

willscott commented 1 month ago

In doing our allocator review of proposed clients, e.g. https://github.com/fidlabs/Open-Data-Pathway/issues/43#issuecomment-2226795779 - we see datasets being proposed which seem to be already well replicated on the filecoin network.

It would be great to develop a set of norms as a filecoin program for what level of replication means that the marginal value to the network of additional copies has fallen below the fil+ goals.

My intuition is that there are a couple lines of argument:

we can talk about locations - if data is not yet replicated to allow low latency access in a specific region, that may warrant additional replication
we can talk about data prep quality - if data has been previously uploaded as a bulk data set, but without provenance, a subsequent upload that includes a justification of the semantic transformation (e.g. individual data items / files becoming individually accessible), and where the uploaded data is verifiable (e.g. a downloaded piece can be transformed back and confirmed by an external party to be part of the original data set) holds additional value that would not have been previously captured.

However, i don't think by itself the 100th replication of a data set is doing much for the network. I am opening this discussion such that we can as a community agree on a higher bar for allocator due diligence in this regard.

lyjmry commented 1 month ago

I have already explained：I mean not only the latest data, but also historical data, because there will be SP sectors that expire, copies will be lost, and SPs will close retrieval. and long-term updates with future data.

lyjmry commented 1 month ago

You can carefully check all LDNs and their data usage range. Are all 50 LDNs replicating the same copy? Are they all used by one company? Are the SP searches among them normal?

willscott commented 1 month ago

I am not arguing that this specific instance of upload is going to be worse than the current quality bar. I am instead arguing that the quality bar is too low, and that we should be as a community pushing for a higher level of value accrual in filecoin deals across the board.

There are 50 clients * 5 replicas = ~250 copies / sp's already holding the bulk of this data. arguing that copies will be lost / sp's will close retrieval does not give me any confidence particularly that this set of uploads will be any better than those others. what are we gaining with these additional uploads?

lyjmry commented 1 month ago

I think you didn't read the instructions carefully: my purpose is to track the newly generated data and supplement the copies of the old data

willscott commented 1 month ago

(That conversation is for the issue in Open-Data-Pathway. This discussion is meant to be about broader allocator policy, not the specific application)

lyjmry commented 1 month ago

(That conversation is for the issue in Open-Data-Pathway. This discussion is meant to be about broader allocator policy, not the specific application)

Thanks for the clarification, FIL+ will get better and better

lyjmry commented 1 month ago

If you still have any disputes about https://github.com/fidlabs/Open-Data-Pathway/issues/43, please express it and I will try my best to answer it for you.