filecoin-project / notary-governance

114 stars 58 forks source link

Modification: Request for "Per Dataset" Applications in the Filecoin+ Program #832

Closed herrehesse closed 3 months ago

herrehesse commented 1 year ago

Issue Description

Some applicants in the Filecoin+ program are submitting merged requests for large scale datacap, which could create a lack of transparency and increase the risk of abuse.

Impact

Dear Filecoin+ Community,

I am writing to request that all applicants in the Filecoin+ program submit "per dataset" applications instead of merged requests. Recently, some applicants have been asking for large scale datacap requests (LDN) for combined smaller sets of data. While this is not necessarily an issue in and of itself, it could make the program more vague and increase the risk of misconduct and abuse.

To keep the Filecoin+ program as transparent and trustworthy as possible, I urge all applicants to submit applications for each dataset they wish to store on the network. This will help to prevent abuse and ensure that the community can easily verify the legitimacy of each request.

I understand that this may require some additional effort on the part of applicants, but I believe it is necessary to maintain the integrity of the program.

Issues

Possible Solutions

laurarenpanda commented 1 year ago

As I mentioned in my application, instead of submitting an individual application for each dataset, it would be better to let filplus-check support the Multi-LDN DataCap and CID Checker Report. We have tried to generate a multi-LDN report using the code from filplus-checker with filplus.info's database, and it worked well.

This function could benefit not only LDN with large public datasets but also LDN with enterprise data or E-FIL+ programs. There is still a 5-PiB limitation of each LDN application, and it's unavailable to have multiple LDNs for one program. And as RG mentioned in Slack, Clients DO NOT need to use the same address for multiple LDNs of the same dataset, and more multiple LDNs will use different addresses to allocate DataCap and meet the problem you mentioned above.

Why don't we improve the current filplus-checker bot to solve this potential issue? Tools are always better than manual handling.

laurarenpanda commented 1 year ago

Sorry for missing this important info in the ReadME. The Checker Bot has already support multiple LDNs checker report.

截屏2023-02-23 10 13 28
kernelogic commented 1 year ago

My opinion: if a particular client provides better indexing/browsing/searching than the bot, it should be acceptable to have merged requests more a more streamlined onboarding experience. For example Slingshot evergreen, Slingshot V3, my singularity browser, Cabrina's browser, FileDrive's browser...

There's also benefits actually: some public datasets are quite small (< 10T) and no one will want to do them. Having merged LDN will give them a chance to be onboarded.

Carohere commented 1 year ago

Agree with @herrehesse. There is a high chance multiple merged datasets will cause problems in both notary DD and later allocation tracking. We need to reach a consensus on applying quota based on the size of each individual dataset.

kernelogic commented 1 year ago

Before consensus is reached, looking at the current existing LDNs, I will say we should allow it until says otherwise :D

laurarenpanda commented 1 year ago

Before consensus is reached, looking at the current existing LDNs, I will say we should allow it until says otherwise :D

Couldn't agree more. There are LDNs following all FIL+ rules existing in the program, and now we have the Check Bot to track deal flow. Before any rule changing, we should allow this kind of LDN keep working.

Carohere commented 1 year ago

I dont think so. After the check bot went live we found that the community was facing massive amounts of datacap abuse. Bot only performs compliance checks once the allocation begins. It doesnt and cannot block all violations. For datasets, applicants, application KYCs still rely the most on notaries. I can't think of any reason why we should allow multiple, scattered datasets to request excessive amounts of Dacap. Separate applications for individual datasets must be something that will enhance transparency.

Carohere commented 1 year ago

By submitting merged requests, applicants could potentially receive more datacap than they would if they submitted individual requests. This could lead to an unfair distribution of resources within the Filecoin+ program. Without the ability to easily verify each individual dataset, it may be more difficult for the community to track and identify any potential abuse of the Filecoin+ program.

This's the exact reason why we should do so

laurarenpanda commented 1 year ago

Separate applications for individual datasets could achieve what you want. Even though doing the whole KYB process as much as possible, we couldn't find DataCap Abuse before we saw the Checker report. For now, the best mothed is to allow LDN have first-week DataCap,check the report and cut loss in time.

Carohere commented 1 year ago

For now, the best mothed is to allow LDN have first-week DataCap,check the report and cut loss in time.

@laurarenpanda a very simple question: do you want to stop scammers at 1PiB or 100TiB?

laurarenpanda commented 1 year ago

I think you're missing the fundamental goal of FIL+. Another question from a different angle: do you want to onboard valuable data onto Filecoin by 1PIB or 100TiB per day?

IMO, we also need a reputation system for Clients to accumulate credibility and allow Clients with long-term good behavior to apply for more DC.

herrehesse commented 1 year ago

@Carohere

"I can't think of any reason why we should allow multiple, scattered datasets to request excessive amounts of Dacap."

I completely agree with your point. Enforcing separate applications for individual datasets will enhance transparency and prevent the merging of datacap requests, which can lead to datacap abuse. It's crucial to maintain a fair and balanced allocation of datacap and prevent any misuse of the system.

Carohere commented 1 year ago

I think you're missing the fundamental goal of FIL+. Another question from a different angle: do you want to onboard valuable data onto Filecoin by 1PIB or 100TiB per day?

I would choose 100TiB because real data has no problem getting support from notaries. Step by step it will eventually be onboarded to the network. But AFAIK, sealed DC will not be revoked atm even if the application is found to be fake after the first or multi other rounds.

Fully undertstand your concerns about efficiency tho. But let's remember that 100TiB is nothing small. It was once the community guideline which I tend to think is a fair range. https://filecoinproject.slack.com/archives/C01DLAPKDGX/p1661450182420319

dkkapur commented 1 year ago

This makes a lot of sense in a very specific case: individual datasets being cherry picked by an individual entity data onboarder that is not the data owner. I support this change for this particular case. We have a lot of noise with open datasets right now, and this would help keep it cleaner in the future.

Every time you see an aggregation or there is an intermediary "service" in between where it is an aggregator (like web3.storage, Estuary), or an individual or a dedicated program with funding like Slingshot, Venus accelerator, etc., this will become unnecessarily complex. However, holding these (and the above) to a high bar as well, i.e., why are you picking what you're picking and what are you doing uniquely is still useful and relevant (but maybe a separate topic).

laurarenpanda commented 1 year ago

@dkkapur Your point is quite valid: we need to consider different situations. There is a key question that I'd like to get advice from you about. Like FileDrive Datasets, it aggregates numbers of public datasets from sources like AWS and applies for DataCap allocation as an individual use case. For this situation, do as an aggregation is reasonable and acceptable?

herrehesse commented 1 year ago

@laurarenpanda no, you should do all of them separately. You are not the data owner nor a onboarding tool like Estuary.