Discussion: Data Prep Questions Process change and Call to Action

kevzak commented 3 months ago

Data Prep Review and Call to Action

Background: Over the past 5 months, the FIDL Open Data Pathway has overseen 10-15 applicants looking to onboard a public dataset LINK.

We’ve found the data preparation portion of questioning on the GitHub application to be limited in providing sufficient information about how data is being transformed and made available for retrieval.

Process Change Tested: In an effort ensure the proper data preparation and retrieval expectations of the Fil+ program are being met, we’ve worked with each of our applicants to better understand their projects.

Below are some of the questions we’ve implemented as part of our client diligence to better validate the data preparation details:

What specifically are the datasets from (enter your weblink) site that you are committing to store? all of it, or a specified enumeration?
What is the transformation from the files available for download and what will be stored on filecoin?
How when we sample your deals will we be able to confirm that it has come from the dataset?
How the data is transformed into deals for filecoin.
When a deal is sampled for verification, how will we be able to confirm that it is part of this dataset? (how is it chunked into car files?)
Given a 32GB payload, what steps can an independent entity take to confirm it comes from the relevant upstream dataset?
We have successfully seen datasets be prepared for filecoin, ranging from internet archive, to wikipedia, to the chain itself. You can either store the metadata structure above the individual file chunks as is done by e.g. web3 storage, or you can have a separate well-advertised metadata layer via e.g. a website.
We want to see how a client could be able to make use of this dataset. Can you share details?
- this could be a client script for how to iterate through / process over the data
- this could be a web site allowing browsing / identification of specific pieces of data from the dataset as stored
- this could be identification of clients making use of the data

Call to Action: We are asking the community and governance team for feedback on this line of questioning toward public open dataset onboarding.

If agreed, we would like to propose to edit the current GitHub application template and include these questions as required questions for all public open dataset allocators.

Allocators and clients would then be held to this level of data prep detail expectations moving forward.

Destore2023 commented 3 months ago

Below for Chinese SPs better understanding ^_^

数据准备评审及行动呼吁

背景： 在过去5个月中，FIDL开放数据通道已经监督了10-15个希望上线公共数据集的申请者(附链接）

我们发现GitHub申请中的数据准备部分提问，未能充分提供有关如何转换数据并使其可供检索的信息。

流程变更测试： 为了确保Fil+项目的正确数据准备和检索期望得到满足，我们与每个申请者合作，更好地了解他们的项目。

以下是我们在客户尽职调查中实施的一些问题，以更好地验证数据准备的细节：

你承诺存储的来自（输入原始链接网站）网站的数据集具体是什么？是全部还是指定列举？
可供下载的文件与在Filecoin上存储的文件有什么转换？
当我们抽样检查你的数据时，如何确认它来自该数据集？
数据是如何转变为Filecoin的交易订单的？
当交易被抽样验证时，我们如何确认它是该数据集的一部分？（它是如何分块成car文件的？）
如果扇区为32GB，一个独立实体可以采取什么步骤确认它来自相关的上游数据集？

我们已经成功看到一些数据集为Filecoin做好准备，从互联网档案馆到维基百科，再到区块链本身。你可以像web3存储那样，将元数据结构存储在单个文件块之上，或者通过例如一个网站拥有一个单独的、宣传良好的元数据层。

我们想看看客户如何利用这个数据集。你能分享一些细节吗？

1）这可以是一个客户端脚本，用于迭代/处理数据 2）这可以是一个允许浏览/识别数据集中具体数据的网页 3）这可以是识别利用这些数据的客户

行动呼吁： 我们邀请社区和治理团队对这一关于公共开放数据集上线的提问方向提供反馈。

如果大家达成一致，我们希望提议编辑当前的GitHub申请模板，并将这些问题作为所有公共开放数据集分配者的必填问题。

分配者和客户将须对这一数据准备细节的期望负责。

zcfil commented 3 months ago

Looks good!

NDLABS-Leo commented 3 months ago

Hi, @kevzak This is a valid recommendation, and we will use this part of the question as an antecedent in the subsequent review.

nj-steve commented 3 months ago

Hi Kevin @kevzak I really think your proposal is very good. I have been participating in the Filecoin project for 4 years, and I have been thinking about how to use data.

First, there is the issue of retrieving data. Because there is no contractual agreement and only relying on system management, this cannot last long. We know retrieving data has been a big issue in community. I think only by designing a reward mechanism we can ensure the retrieval rate of miners. If there are rewards, miners will be happy to support retrieve.

The second is the issue of using data. We need to store some data that can be used by some people. When applying, we must write down how the data will be used, what benefits it has, and other characteristics. Convenient for users who need it to use.

Third, there is the issue of data authenticity. This requires self-demonstration or community inspection. As you said, we conduct pollutant surveys on the data. The data for this demonstration comes from this data set. I will also conduct data authenticity checks later.

AlanGreaterheat commented 3 months ago

As an allocator, it is also important to balance the ease of client data onboarding with the effective increase of SPs' computing power. Both overly stringent and overly lenient approaches can be detrimental to the development of the Filecoin ecosystem.Instead, entrusting the decision-making power to the allocator reflects the important role of the allocator.

willscott commented 3 months ago

My understanding is that the intention of the intention of expanding fil+ to 'public, open data sets' was to provide an additional option when a direct paying client couldn't be easily found - that instead value could be provided to the filecoin ecosystem by adding open datasets that cold be later computed on or worked with by downstream clients.

in order for that value to be realized, i agree with @kevzak that it's important for the data preparation around such public open data sets to be also public and explicit. The future client can't get that value if it doesn't know how to interact with the data that's in filecoin.

The main reason I can imagine that it could be seen as detrimental to share such scripts / data preparation process is because it is seen as a cost that a preparer has invested in that they don't want others to be able to freeload. I think with https://github.com/filecoin-project/Allocator-Governance/issues/86 this is less of a problem, because the investment in a new data preparation process for each dataset becomes needed anyway.

kernelogic commented 3 months ago

No reliable retrieval method available, even now. Boost is still only considered "experimental" by the its devs. It is very fragile to work with, constant indexing error.

It's premature to talk about retrievals without proper tooling support. Retrieval sampling should never be one of the evaluating criteria to begin with, until a rock solid solution is available.

MikeH1999 commented 3 months ago

No reliable retrieval method available, even now. Boost is still only considered "experimental" by the its devs.

It's premature to talk about retrievals without proper tooling support.

Agree with @kernelogic, all of the above problems can not be separated from the retrieval, the current retrieval criteria based on the spark team's tool, but currently it only provides the retrieval rate of this criterion, I can be sure that there are still problems to be dealt with about the retrieval rate here

To retrieve the complete data, can it only be handled manually at the moment, or is there already a path being developed to view the audited data through other interfaces?

EVS2024 commented 3 months ago

Such an operation may create a more complex workload for both clients and SPs, and the vast majority of SPs are not in a very good position as far as the current FIL-Ecology is concerned, should we propose to simplify these paperwork processes?

luhong123 commented 3 months ago

Feels like it will cause more hassle for client and SPs and will make storing data more complicated

stph51 commented 2 months ago

Without the proper tools, these questions could lead to a more complicated operation and validation process, increasing the workload for all parties involved, including SPs, clients, allocators, and even the government team.

filecoin-project / Allocator-Governance

Discussion: Data Prep Questions Process change and Call to Action #125