filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] Protocol Labs - Slingshot v3 #478

Closed dkkapur closed 1 year ago

dkkapur commented 2 years ago

Large Dataset Notary Application

To apply for DataCap to onboard your dataset to Filecoin, please fill out the following.

Core Information

Please respond to the questions below by replacing the text saying "Please answer here". Include as much detail as you can in your answer.

Project details

Share a brief history of your project and organization.

The Slingshot program started off as a collaborative-competitive community program to accelerate the storage of real, valuable open data on the Filecoin network, rewarding participants along the way. Slingshot v3 is the next significant iteration of the program, with the intent of storing open datasets from a variety of different sources and taking the learnings from Slingshot Restore and Evergreen to emphasize the permanence of the data onboarded.

Protocol Labs has supported the Slingshot competition and will be the primary owner of the Slingshot deal engine. Protocol Labs is an OSS R&D Lab and was founded in 2014. Software developed by PL includes IPFS, Filecoin (lotus implementation), libp2p, drand, IPLD and more. 

What is the primary source of funding for this project?

PL will be funding the development and operations of the Slingshot program and deal engine.

What other projects/ecosystem stakeholders is this project associated with?

The main stakeholders for this project include participants in the program (data preparers, storage providers), data owners that are interested in having their open datasets mirrored onto the decentralized web, and the Filecoin Foundation / FFDW.

Use-case details

Describe the data being stored onto Filecoin

Each of the Slingshot datasets come from reputable organizations where the licensing allows for public storage of the data. For the first phase of v3, we will be looking at replicating datasets from the Registry of Open Data on AWS, Common Crawl, and Sloan Digital Sky Survey.

The full list of datasets that are eligible for past phases of Slingshot can be found [here](https://github.com/filecoin-project/slingshot/blob/master/datasets.md). The [Slingshot Data Explorer](https://slingshot.filecoin.io/explore) also has a list of datasets stored in the past.

Where was the data in this dataset sourced from?

The original data was sourced directly from the organizations that own it or mirrors they support (i.e., the Registry of Open Data on AWS - https://registry.opendata.aws/). 

Can you share a sample of the data? A link to a file, an image, a table, etc., are good ways to do this.

- Foldingathome COVID-19 dataset - https://registry.opendata.aws/foldingathome-covid19/ 
- Common Crawl datasets - https://commoncrawl.org/the-data/get-started/

Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data).

Yes, these are licensed for public access.

What is the expected retrieval frequency for this data?

We will be sampling retrievability weekly. We also expect some demand from researchers and developers for this data in the coming years. 

As part of additional replication in the future to keep stable mirrors of this dataset in the network, cross SP retrievals will be required to onboard more replicas with different SP organizations and in different datacenters / regions across the world.

Therefore, we expect the data to be retrieved somewhat frequently, 1-2 times per week.

For how long do you plan to keep this dataset stored on Filecoin?

Ideally forever! With takeaways from evergreen.filecoin.io and tools coming through like the FVM, we'd like to build a long term mirror of these datasets on the network.

DataCap allocation plan

In which geographies (countries, regions) do you plan on making storage deals?

The Slingshot deal engine will plan to distribute up to 10 replicas for each piece of data that is prepared by Data Preparers, with the following distribution requirements for replicas:

- No more than one replica per city / datacenter
- No more than three replicas per country
- No more than four replicas per continent

How will you be distributing your data to storage providers? Is there an offline data transfer process?

Data Preparers (DPs) are responsible for getting CAR files to SPs for deal making. This will primarily happen through off-network but over-the-wire data transfer. In some cases, we expect DPs to use a completely offline or on-network data transfer process as well. 

SPs can also obtain relevant pieces through retrievals on the Filecoin network. 

How do you plan on choosing the storage providers with whom you will be making deals? This should include a plan to ensure the data is retrievable in the future both by you and others.

We plan for SPs interested in participating to fill out a Slingshot v3 specific [onboarding form](https://docs.google.com/forms/d/e/1FAIpQLSfhyDUF3KS8QjiwOYw1V-FqiyTHgkDCJRQ1snCNnso__5rblA/viewform?usp=sf_link) and go through a lightweight KYC process before they will be eligible to receive deals from the deal engine. This includes checking for location, miner IDs, and confirms their ability to service retrievals as required by the program. 

Over time, the retrieval sampling we do will also be made available through the Filecoin Reputation System, and retrieval success rate will be used as a measure of reliable performance to enable future dealmaking from the deal engine. SPs that do not serve retrievals as required by the program will not be eligible for future deals.

How will you be distributing deals across storage providers?

The deal engine will be available as an API for registered SPs to reach out to, to request deal proposals for eligible piece CIDs that they would like to store in deals on chain. The goal is to distribute replicas across SP operators and regions, using the following heuristics:
Max 1 replica per datacenter/city
Max 3 replicas per Storage Provider organization
Max 3 replicas per country
Max 4 replicas per continent

Do you have the resources/funding to start making deals as soon as you receive DataCap? What support from the community would help you onboard onto Filecoin?

We expect to launch the deal engine in the second week of July and begin making deals immediately. The program will rely on decentralized data preparation through DPs. If you are interested in participating, please watch this space, I'll share updates for next steps!
large-datacap-requests[bot] commented 2 years ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!
large-datacap-requests[bot] commented 2 years ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

galen-mcandrew commented 2 years ago

Datacap Request Trigger

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

750TiB

Client address

f1tefalattwqvw22kstrryh7yoh2blexkl3gs32py

large-datacap-requests[bot] commented 2 years ago

DataCap Allocation requested

Multisig Notary address

f01858410

Client address

f1tefalattwqvw22kstrryh7yoh2blexkl3gs32py

DataCap allocation requested

256TiB

liyunzhi-666 commented 2 years ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecfjo7ghxeulzhrdd53vfx7sxawsxcxw7a3elckakmrh6vaovfwqe

Address

f1tefalattwqvw22kstrryh7yoh2blexkl3gs32py

Datacap Allocated

256.00TiB

Signer Address

f1pszcrsciyixyuxxukkvtazcokexbn54amf7gvoq

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecfjo7ghxeulzhrdd53vfx7sxawsxcxw7a3elckakmrh6vaovfwqe

kernelogic commented 2 years ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacedpzti4fmfzlljx2zn3eldqdcfuyv7vydfsbuo25m27o5ibih5wde

Address

f1tefalattwqvw22kstrryh7yoh2blexkl3gs32py

Datacap Allocated

256.00TiB

Signer Address

f1yjhnsoga2ccnepb7t3p3ov5fzom3syhsuinxexa

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedpzti4fmfzlljx2zn3eldqdcfuyv7vydfsbuo25m27o5ibih5wde

Sunnyiscoming commented 1 year ago

Are there any problems with using datacap?

xmcai2016 commented 1 year ago

@Sunnyiscoming we were blocked on a dependency service but are getting unblocked. Will starting using datacap shortly.

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 14 days, so for now it is being closed. Please feel free to contact the Fil+ Gov team to re-open the application if it is still being processed. Thank you!