filecoin-project / notary-governance

114 stars 58 forks source link

v5 Notary Allocator Application: Open Public Dataset Pathway #996

Closed kevzak closed 5 months ago

kevzak commented 8 months ago

v5 Notary Allocator Application

To apply to be an allocator, organizations will submit one application for each proposed pathway to DataCap. If you will be designing multiple specific pathways, you will need to submit multiple applications.

Please complete the following steps:

1. Fill out the information below and create a new GitHub Issue

  1. Notary Allocator Pathway Name (This can be your name, or the name of your pathway/program. For example "E-Fil+"): Open Public Dataset Pathway
  2. Organization Name: Filecoin Incentive Design Lab (FIDL)
  3. On-chain address for Allocator (Provide a NEW unique address. During ratification, you will need to initialize this address on-chain): f1v24knjbqv5p6qrmfjj5xmlaoddzqnon2oxkzkyq
  4. Country of Operation (Where your organization is legally based): United States
  5. Region of Operation (What region will you serve?): All regions
  6. Type of Allocator, diligence process: (Automated/programmatic, Market-based, or Manual (human-in-the-loop at some phase): Manual
  7. DataCap requested for allocator for 12 months of activity (This should be an estimate of overall expected activity. Estimate the total amount of DataCap you will be distributing to clients in 12 months, in TiB or PiB): 100 PiB

2. Access allocator application (download to save answers)

Click link below to access a Google doc version of the allocator application that can be used to save your answers if you are not prepared to fully submit the application in Step 3. https://docs.google.com/document/d/1-Ze8bo7ZlIJe8qX0YSFNPTka4CMprqoNB1D6V7WJJjo/copy

3. Submit allocation application

Clink link below to access full allocator questionnaire and officially submit your answers: https://airtable.com/appvyE0VHcgpAkt4Z/shrQxaAIsD693e1ns

Note: Sections of your responses WILL BE posted back into the GitHub issue tracking your application. The final section (Additional Disclosures) will NOT be posted to GitHub, and will be maintained by the Filecoin Foundation. Application information for notaries not accepted and ratified in this round will be deleted.

Kevin-FF-USA commented 7 months ago

Wanted to let you know this application has been received. Once you complete and submit the include Airtable (form) information - the public answers will be posted in a thread below soon. If you have any questions - please let me know.

ghost commented 7 months ago

Basic Information

1. Notary Allocator Pathway Name: Public Open Dataset Pathway

2. Organization: Data Preservation Institute

3. On Chain Address for Allocator: f1v24knjbqv5p6qrmfjj5xmlaoddzqnon2oxkzkyq

4. Country of Operation: United States

5. Region(s) of operation: South America, North America, Oceania, Europe, Greater China, Asia minus GCR, Africa , Japan

6. Type of Allocator: Manual

7. DataCap requested for allocator for 12 months of activity: Our estimate comes from the following information and assumptions: Over the last 8 weeks (Oct - Dec 2023) in the LDN pathway there was an average of 13 PiBs onboarded per week with approximately 15 applicants accounting for 80% of the weekly onboarding. Looking at the 15 applicants from last week as a sample, we can estimate that 10% will meet Fil+ guidelines under the new AC Bot compliance checks. We can hope to be able to coach or encourage 20-30% of those clients to change onboarding habits and meet standards and use this allocator (onboarding with SPs across geo political regions being the main change required). Additionally, we can estimate interest of another 10-20% of new clients/SPs to use this allocator. So let’s estimate that of the current 13PiBs per week, we will support 50% of the weekly onboarders (0.5*13 PiB) = 6.5PiB per week average onboarding. Based on these calculations, across 12 months we will likely request 330 PiBs as a conservative estimate. We’d like to revisit each quarter to review demand and potentially re-estimate needs.

8. Is your allocator providing a unique, new, or diverse pathway to DataCap? How does this allocator differentiate itself from other applicants, new or existing?: Our allocator is providing a pathway for any open public datasets to be onboarded to Filecoin. This pathway is only for datasets that are made readily retrievable on the network and can be regularly verified (through the use of manual or automated verification that includes retrieving data from various SPs over the course of the DataCap allocation timeframe). Enterprise (private/encrypted) dataset use cases will be guided to use a different pathway. This pathway framework is based on the LDN (Large Data Notary) pathway that existed before 2024.

9. As a member in the Filecoin Community, I acknowledge that I must adhere to the Community Code of Conduct, as well other End User License Agreements for accessing various tools and services, such as GitHub and Slack.: Acknowledge

Client Diligence

10. Who are your target clients?: Enterprise Data Clients, Small-scale developers or data owners, Individuals learning about Filecoin, Other (specified above)

11. Describe in as much detail as possible how you will perform due diligence on clients. If you are proposing an automated pathway, what diligence mechanism will you use to determine client eligibility?: We will manually vet all client applicants upfront to confirm who they are, what data they are onboarding, how they will prepare the data and which SPs will be involved in onboarding copies of the data. Application Clients will be required to apply using the following GitHub application form: (LINK SOON) which contains questions related to the client role, data preparation, financing, dataset details, and storage provider distribution plan. All responses will be reviewed and assessed versus the Fil+ guidelines for open data storage. Specifically: Client - confirmation of applicant and connection to dataset Data Preparation - who is preparing, what tool(s) are used Retrievability - readily retrievable on the network and can be regularly verified (though the use of manual or automated verification that includes retrieving data from various miners over the course of the DataCap allocation timeframe Distribution of Onboarding across entities and geopolitical locations The application responses will be made public in the GitHub repo and all communication between client and allocator team will be made in comments of the application. GitHub ID Due diligence will also involve confirming usage associated with the client GitHub ID to enable applicants to add a layer of trust and ultimately utilize one GitHub ID and build a GitHub ID reputation as a good actor over time. New User Check The first checks to be completed on each application by the allocator: Is this a completely new GitHub ID? (less than 2 months old) This is the first time this GitHub ID has applied for DataCap in this or other pathways? If yes to either, applicants will have a maximum DataCap allowance for their first application Client Check (KYC) Additionally, all applicants will be asked if they are willing to complete a free KYC check to confirm themself as a human user (the process is completed via third party app is explained in detail below in #13) Clients can also choose another method to prove the identity of the applicant (which must first be vetted by the allocator team and made public for transparency). If they decline the check, they will be significantly limited in the maximum amount of DataCap they can request unless they are able to provide other forms of client identification. Future areas of development and POCs: Other forms of KYC to be considered: Can you provide set of information with sufficient proofs? Are you a Client with known previous reputation in the community? Are you a Client with a known non-Filecoin related brand (like Cern, Microsoft, Disney..etc)? Do you have a sponsor within Filecoin community with a great reputation? Ideally part of FF or PL? Can you provide reference letters from at least 3 existing notaries that will take responsibility if a client they support shows behavior deemed abusive? Additionally in the future we would like to consider small scale automation using quantifiable diligence metrics such as GitHub ID KYC and history, staking

12. Please specify how many questions you’ll ask, and provide a brief overview of the questions.: As mentioned above, we will manually vet all client applicants upfront to confirm who they are, what data they are onboarding, how they will prepare the data and which SPs will be involved in onboarding copies of the data. Our application has 23 questions and will be the same template as the current LDN application. See link: https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/new/choose

13. Will you use a 3rd-party Know your client (KYC) service?: Yes, all applicants will be presented with the option to complete a free third party KYC check. We have integrated with a third party KYC app, Togggle.io, into our application form. Togggle is compatible across over 190 countries to validate identification (KYC). See more here: https://www.togggle.io/ Togggle provides a solution to the problem of validating a unique human user behind each GitHub ID without exposing user information. With our solution design, users are asked to validate their ID and livelihood (KYC) but their submitted ID information is then encrypted, stored in a decentralized manner across servers, and never shared in public in our GitHub repo. Also, members of the allocator team do not have direct access to the client information. The only list made public is a list of GitHub IDs that have passed the KYC check and that list can be found here: https://filplus.storage/api/get-kyc-users Once a user is verified, their GitHub account receives a ‘KYC verification’ label that we will use as a layer of trust on the account. To date, we have invested $9000 in this integration and KYC has been completed by 75 users. The KYC check cost is currently covered by the Allocator team. However, in the future we may transition to charging for new checks ($3-5 each). Also, we are open to using other forms of KYC to support client use cases. Clients can submit ideas or products they are willing to test/pay for and the allocator team will review.

14. Can any client apply to your pathway, or will you be closed to only your own internal clients? (eg: bizdev or self-referral): Any client can apply to our pathway and they can discover and learn more about us on filplus.storage. We also hope to have links and marketing on other sites, such as filecoin docs

15. How do you plan to track the rate at which DataCap is being distributed to your clients?: Datacapstats.io will be connected to our GitHub repo and will track all DataCap distribution information in real time as well as monitoring chain messages from our notary address We are also creating new dashboard specs to help showcase the health of our allocator pathway, such as better snapshot metrics about number of clients approved, Time to DataCap, bot health, and more.

Data Diligence

16. As an operating entity in the Filecoin Community, you are required to follow all local & regional regulations relating to any data, digital and otherwise. This may include PII and data deletion requirements, as well as the storing, transmit: Acknowledge

17. What type(s) of data would be applicable for your pathway?: Public Open Commercial/Enterprise, Public Open Dataset (Research/Non-Profit)

18. How will you verify a client’s data ownership? Will you use 3rd-party KYB (know your business) service to verify enterprise clients?: We will facilitate two options for applicable types of data: Public open and public commercial datasets. If it is a Public Open Dataset - We will review the dataset web links and samples included in the application We will ask the client to confirm that they have the right to store the dataset We will check data ownership rights, looking for various open licensing standards such as those listed here: https://standards.theodi.org/introduction/what-are-open-standards-for-data/

If it is a Public Open commercially owned dataset - we will ask: Are you the Data Owner or are you applying on behalf of the Data Owner? They can prove this by completing a business (KYB) check. They can complete a KYB check using a third party integration, https://efilplus.synaps.me/signup ($100 per check). This option has been in use within Fil+ for one year and over 20 clients have attempted KYB, 10 have successfully completed the check. We’ve invested $3000 for this integration to date. If this option doesn’t work, clients can suggest other KYB third party apps. As an allocator, we are willing to vet and consider approving the use of other third parties meeting our due diligence requirements Or they can plan a virtual meeting with the client, data owner and member of the allocator team to review: the dataset confirm ownership (proof of employment, employer signoff, sharing the business license) and validate storage of the data by the client/applicant is approved and a contract is in place If requested by a client, we will utilize various non-disclosure agreements to collect required information on clients and data-owners while maintaining their privacy.

19. How will you ensure the data meets local & regional legal requirements?: In the client application we will have a question asking the applicant to confirm they are legally able to represent and store the data in question. This would include asking clients to attest that they are familiar with local & regional requirements that would apply to themselves and any SPs they intend to transact with

20. What types of data preparation will you support or require?: There is no specific or single data prep tool required. The expectation is that data is properly packed, indexed and retrievable. We will promote usage of data prep tooling built by Protocol Labs teams and network partners and encourage clients to utilize these. Examples: Singularity, web3.storage. If a data preparer is not using a known tool, they can describe the preparation process being used fully in their application, and it will be reviewed upon subsequent allocation checks to validate the tool being used is meeting expectations.

21. What tools or methodology will you use to sample and verify the data aligns with your pathway?: Our pathway allows any type of public open dataset to be onboarded. Additionally, we ask the client to submit web links and data samples of the data and we will check to confirm that the dataset does not include any offensive or illegal content. Examples: Sexually explicit content Images of child sexual abuse Footage of real or simulated violence, criminal activity or accidents from video clips, games or films Content that advocates the doing of a terrorist act Content instructing or promoting crime or violence Content promoting racism and hate speech We may sample portions of data before initial allocations, as well as ongoing data sampling. This may mean that SPs are required to store a hot copy to meet client demands. This is still an ongoing discussion, and we are investigating tools and costs. Additionally, after data is stored, we are investing in more automated tooling that can help retrieve, sample, and investigate client data. We may use Spark for example.

Data Distribution

22. How many replicas will you require to meet programmatic requirements for distribution?: 2+

23. What geographic or regional distribution will you require?: Current Fil+ guidelines call for three locations. Because we are asking for 2+ replicas, we will ask for clients to include at least two physical locations, each in a separate geopolitical region. We ask the clients in the application to list their SP partners and will check for 2 in different geopolitical regions. If not, we will ask the client to update their application with more information about their storage plan or until guidelines are met.

24. How many Storage Provider owner/operators will you require to meet programmatic requirements for distribution?: 2+

25. Do you require equal percentage distribution for your clients to their chosen SPs? Will you require preliminary SP distribution plans from the client before allocating any DataCap?: Yes, clients will need to manage SP distribution plans and ensure distribution stays equal (if only 2) and also within the following guidelines: One Storage provider miner ID cannot store more than one copy Storage provider owner/operator should not be storing duplicate data for more than 20%. Clients are required to submit SPs upfront. If client plans are different than the original guidelines, they will need to clearly map distribution plans upfront. All information is collected in our application process and stored in GitHub.

26. What tooling will you use to verify client deal-making distribution?: We will use the CID Checker bot tool developed by the Protocol Labs team. Link to main repo: https://github.com/data-preservation-programs/filplus-checker-assets/tree/main CID checker bot reviews on-chain information and looks at: Storage Provider distribution Deal data replication Deal data shared with other clients The CID bot is part of the larger AC bot (Aggregate and Compliance Bot) and will automatically run and check all applications on a weekly basis. We will follow guidance from AC Bot on decision making about approving or denying subsequent allocations.

27. How will clients meet SP distribution requirements?: Our allocator pathway prioritizes clients presenting information and making clear and provable claims regarding their plan for distributed storage across multiple storage provider owner operator entities and locations to ensure compliance with the Fil+ guidelines. In order to enable client success with this process, we will be marketing vetted SPs through a marketplace tool (GitHub page soon) we will create where Storage Providers are able to complete KYC/KYB upfront confirming who they are (entity), miner IDs and locations and after, only the SP miner ID and location information will be available to clients to search and match with SPs that fit their requirements. Initially onboarding and vetting SPs will be a manual review process completed by the team. However, we are also investigating the use and cost of network monitoring tooling that would provide additional information about SP IP locations and could be automated to check and validate locations. If a client does not intend to use SPs from the vetted SP marketplace , or a vetted Protocol Labs network tool (example:SPADE), then they will be required to provide additional KYB on the SPs they will use to onboard data, in order to get additional allocations approved. Examples include: Business license, proof of datacenter address

28. As an allocator, do you support clients that engage in deal-making with SPs utilizing a VPN?: Utilization of VPN is an acceptable practice. However, information about SP entities and locations distribution will be required regardless of VPN usage.

DataCap Allocation Strategy

29. Will you use standardized DataCap allocations to clients?: No, client specific

30. Allocation Tranche Schedule to clients:: Each application will have the GitHub ID assessed to confirm if they are a new GitHub ID (less than 2 months old) or first time user to the allocator. If so they will follow first allocation schedule below: First Time User Allocation Schedule:  Did you complete the third party KYC check or another form of KYC? If yes, then they become eligible to receive up to 50 TiBs of DataCap If no, the max they will be allowed receive at anytime is 10 TiBs of DataCap

For users utilizing a GitHub ID older than 2 months and have successfully onboarded public open datasets in the LDN pathway (before 2024)

Trusted User Allocation Schedule: If a user has successfully onboarded a dataset using the First time allocation schedule OR they are a trusted GitHub ID user: AND have completed the third party KYC check or other form of KYC, then they become eligible to apply for up to 5 PiBs of DataCap If they have onboarded a dataset successfully in the the past, but they did not complete or chose not to complete any form of KYC check, the max they will be allowed to apply for at any time  is 10 TiBs of DataCap *Note: if first time applicants apply for multiple applications at the same time, only after a completion of one, will the count be included and increased allocation sizes become available. The allocation schedule for trusted users is:                                           1st allocation         5% 2nd allocation       15% 3rd allocation        30% 4th allocation        50%

After successful onboarding as a trusted GitHub ID, users then become eligible to apply for 5PiB+ as needed to meet their demand.

31. Will you use programmatic or software based allocations?: Yes, standardized and software based

32. What tooling will you use to construct messages and send allocations to clients?: We will use existing notary registry tooling at https://filplus.fil.org/#/

33. Describe the process for granting additional DataCap to previously verified clients.: When clients use up > 75% of the prior DataCap allocation, a request for additional DataCap in the form of the next tranche is automatically kicked off (via the subsequent allocation bot'). We will set an SLA (Service Level Agreement) to keep up with allocation review and comment on bot messages within 3 days. This could change depending on the demand and number of applications received. Two other thing to note about granting DataCap: We will set an expiration date on allocated DataCap of 3 months. There is already built into the allocation bot a stale check that will close applications after 14 days of being idle. That bot will continue to be in effect however, clients can comment before 14 days to keep open or, if closed, request for closed applications to be reopened as needed. However, from an allocation date, we will measure 3 months time and if the allocation has not been used (open or closed status), the application will be closed and remaining DataCap removed. The expectation when the full amount of DataCap is allocated is that the client has completely finished onboarding their dataset and replicas. If a client closes the application early, they will be questioned as to why If a client abandons the application and becomes non responsive, their GitHub ID will be flagged Checks can also be requested by the allocator team to confirm completion of Dataset storage across all replica sites.

34. Describe in as much detail as possible the tools used for: • client discoverability & applications • due diligence & investigation • bookkeeping • on-chain message construction • client deal-making behavior • tracking overall allocator health • disput: • due diligence & investigation - notary registry and github • bookkeeping - json and github • on-chain message construct: tbd • client deal-making behavior: datacap stats • tracking overall allocator health: datacap stats • dispute discussion & resolution: Google form and zoom/slack • community updates & comms: notary governance call and slack

Tools and Bookkeeping

35. Will you use open-source tooling from the Fil+ team?: As the team that developed most of the open source tooling used today for pathway, we will continue to utilize tools in this pathway and iterate as necessary.

36. Where will you keep your records for bookkeeping? How will you maintain transparency in your allocation decisions?: Public: -In the GitHub applications, KYC check approvals are automatically linked from Togggle.io database to a list in https://filplus.storage/api/get-kyc-users which is linked to the GitHub application repo. No personal information is shared from Togggle to GitHub. -For KYB checks, we will provide manual updates in the comments regarding clients completion of required application due diligence checks. -For SP entity and location verification - only miner ID, entity name and location will be shared in comments. -Overall, any comments made by the allocator team will not include any personal information such as client names or emails as to not open users up to any potential spamming.

Private: -KYC personal information is kept in a third party, Toggle.io, database. A record of the GitHub users that have completed KYC is automatically pulled from the Togggle Database via API to https://filplus.storage/api/get-kyc-users , no personal information is shared from Togggle to GitHub. Anyone has access to see the list of IDs that have completed a check successfully. -KYB personal and business information is kept in a third party, Synaps.io, database. Only the Allocator team has access to login to a dashboard and confirm completion of the KYB checks per application. In the future, we may setup an automatic API call from Synaps directly to GitHub similar to the KYC process as to keep all information private and only a message regarding completion passed to GitHub. -For video conference call due diligence checks with a client and or data owner, we will keep a digital a record of the call and who participated with any key notes in a document available to members of the allocator team only. This information will be stored in a team drive for up to 2 years.

-If requested by a community member to prove that KYC, KYB, SP Verification, or Video due client/business diligence completion took place, we will proactively provide KYC/KYB and allocator drive folder logins to the Filecoin Foundation team so they can conduct audits as needed to confirm information.

Risk Mitigation, Auditing, Compliance

37. Describe your proposed compliance check mechanisms for your own clients.: After each allocation we will manually review the on-chain deal making activity of the applicant to confirm compliance mostly relying on the AC Bot, which runs weekly, to identify non-compliance of deal making, distribution and retrievals and that information will be used to drive action on applications. The bot will be set up to automatically close applications after several allocations if thresholds are not met. At any point if clients are caught providing fake or misleading information about themselves or their SP partners, we will close any open applications and add the GitHub user IDs and miner IDs involved and block them from future participation in the allocator. We’ll track and audit DataCap distribution by looking at usage across our dashboards. We’ll be looking for anomalies in onboarding rates or other trends that might signal abusive behavior. Regarding new client tolerance, we’ve set up new client processes to limit new applicants and new GitHub IDs on DataCap, especially on their first application. We’ve also set up a KYC process to allow clients to add a layer of trust and access to more DataCap. After a successful onboarding, clients using the same GitHub user ID will become eligible for more DataCap on subsequent applications.

38. Describe your process for handling disputes. Highlight response times, transparency, and accountability mechanisms.: For disputes between our allocator and client, hereby termed appeal(s), we will source the appeals through the Open Data Allocator Appeals Form where all our clients can submit an appeal and someone on the team will address it with a 14 day SLA. We would like to respect the privacy of the client and do not plan to host a public resolution process. For disputes raised by community members/non-clients about our allocation approach and strategy, we will comply with the public dispute tracker that is being built by the Filecoin Foundation Governance team. We can commit an SLA for such disputes to be 21 days.

39. Detail how you will announce updates to tooling, pathway guidelines, parameters, and process alterations.: We’ll transparently present updates to tooling, guidelines, parameters and process alterations before they happen. We’ll document all proposed changes in an issue in our repo and share in designated slack channels and also bring to community governance calls as needed to present and receive feedback before any changes are made.

40. How long will you allow the community to provide feedback before implementing changes?: We’ll allow any feedback for 1-2 weeks prior to implementing proposed changes. Community members can submit comments on the proposed issues in our repo. Depending on the weight and impact of a proposed change on the community, we will review all comments and feedback and decide if a soft consensus is needed and request community members to weigh in.

41. Regarding security, how will you structure and secure the on-chain notary address? If you will utilize a multisig, how will it be structured? Who will have administrative & signatory rights?: We will utilize a multisig, 2 people from the entity will hold separate ledgers and will be signers for each allocation.

42. Will you deploy smart contracts for program or policy procedures? If so, how will you track and fund them?: Not at this time, perhaps with future iterations we will introduce this feature

Monetization

43. Outline your monetization models for the services you provide as a notary allocator pathway.: Currently there is no plan to monetize our allocator. We are funded in the near term, but the strategy and monetization plan could change in the future.

44. Describe your organization's structure, such as the legal entity and other business & market ventures.: Delaware Corporation Filecoin Data Preservation Foundation 1111B S Governors Ave #7426 Dover, DE 19904

45. Where will accounting for fees be maintained?: N/A

Past Experience, Affiliations, Reputation

46. If you've received DataCap allocation privileges before, please link to prior notary applications.: N/A

47. How are you connected to the Filecoin ecosystem? Describe your (or your organization's) Filecoin relationships, investments, or ownership.: Members of our entity were previously members of the Data Programs team in Protocol Labs before nucleation. That included supporting the governance team as operational and tooling resources to run the LDN and E-FIl+ pathways.

48. How are you estimating your client demand and pathway usage? Do you have existing clients and an onboarding funnel?: As mentioned in question 7, we estimate a need for ~330 PiB of DataCap in 2024. Our estimate comes from the following assumptions and information: *Our main assumption is that many existing clients from the LDN pathway will migrate over to our pathway. Over the last 8 weeks (Oct - Dec 2023) in the LDN pathway there was an average of 13 PiBs onboarded per week with approximately 15 applicants accounting for 80% of the weekly onboarding. Looking at the 15 applicants as a sample, we can estimate that 10% will meet Fil+ guidelines under the new AC Bot compliance checks. We can hope to be able to coach or encourage 20-30% of those clients to change onboarding habits and meet standards and use this allocator (onboarding across geo political regions being the main change required). Additionally, we can estimate interest of another 10-20% of new clients/SPs to use this allocator. So let’s estimate that of the current 13PiBs per week, we will support 50% of the weekly onboarders (0.513 PiB) = 6.5PiB per week average onboarding.

galen-mcandrew commented 5 months ago

Datacap Request for Allocator

Address

f2gf2at3tv5mtv7cbkq6lh3ptgzz4aju5fweciyaq

Datacap Allocated

5PiB

filplus-bot commented 5 months ago

The request has been signed by a new Root Key Holder

Message sent to Filecoin Network

bafy2bzacea3a3pt2lo3hnxm3suwbcgn7jmomkuuy4vxulfn5zqnl3c7ksepte

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacea3a3pt2lo3hnxm3suwbcgn7jmomkuuy4vxulfn5zqnl3c7ksepte