filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] Speedium - NIH NCBI Sequence Read Archive [2 / 27] #1511

Closed herrehesse closed 1 year ago

herrehesse commented 1 year ago

Data Owner Name

NIH - National Institute of Health

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://www.nih.gov/

Social Media

https://www.facebook.com/nih.gov/

Total amount of DataCap being requested

5PiB

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1mgnwoczfj25foxn4555wvwyak6rsynzy7z73azy

Custom multisig

Identifier

No response

Share a brief history of your project and organization

Since its launch, the Filecoin network has become an important player in the decentralised storage space, offering a secure and transparent alternative to traditional data storage solutions.

We as Speedium / DCENT have been engaged with storing real and valuable datasets on the Filecoin network since Slingshot 2.6 and have been actively developing tools to improve the process. We are always on the lookout for new and useful client data to onboard.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

NIH NCBI Sequence Read Archive (SRA) on AWS
The Sequence Read Archive (SRA), produced by the [National Center for Biotechnology Information (NCBI)](https://www.ncbi.nlm.nih.gov/) at the [National Library of Medicine (NLM)](http://nlm.nih.gov/) at the [National Institutes of Health (NIH)](http://www.nih.gov/), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

IPFS, lotus, singularity, others/custom tool

If you answered "other/custom tool" in the previous question, enter the details here

Custom automated database to track on chain data.

Please share a sample of the data

https://registry.opendata.aws/ncbi-sra/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, North America, South America, Europe, Australia (continent)

How will you be distributing your data to storage providers

HTTP or FTP server, IPFS, Shipping hard drives, Lotus built-in data transfer

How do you plan to choose storage providers

Slack, Big data exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

We do not know the exact providers in advance.

How do you plan to make deals to your storage providers

Boost client, Lotus client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

cryptowhizzard commented 1 year ago

Hi,

Hidde is taking over part of my work here. This one is valid for Speedium / Dcent as followup for my previous requests.

raghavrmadya commented 1 year ago

Datacap Request Trigger

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

1PiB

Client address

f1mgnwoczfj25foxn4555wvwyak6rsynzy7z73azy

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Multisig Notary address

f01858410

Client address

f1mgnwoczfj25foxn4555wvwyak6rsynzy7z73azy

DataCap allocation requested

256TiB

Id

10ad0329-858e-4439-ac6c-16108cda3080

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report[^1]

There is no previous allocation for this issue.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

kernelogic commented 1 year ago

Public data. Willing to support.

laff2022 commented 1 year ago

Dear Applicant,

Due to the increased amount of erroneous/wrong Filecoin+ data recently, on behalf of the entire community, we feel compelled to go deeper into datacap requests. Hereby to ensure that the overall value of the Filecoin network and Filecoin+ program increases and is not abused.

Please answer the questions below as comprehensively as possible.

Customer data

Could you demonstrate exactly how and to what extent customer contact occurred? We expect that for the onboarding of customers with the scale of an LDN there would have been at least multiple email and perhaps several chat conversations preceding it. A single email with an agreement does not qualify here.

Did the customer specify the amount of data involved in this relevant correspondence?

Why does the customer in question want to use the Filecoin+ program?

Should this only be soley for acquiring datacap this is of course out of the question. The customer must have a legitimate reason for wanting to use the Filecoin+ program which is intended as a program to store useful and public datasets on the network.

Why is the customer data considered Filecoin+ eligible? (As an intermediate solution Filecoin offers the FIL-E program or the glif.io website for business datasets that do not meet the requirements for a Filecoin+ dataset)

Files and Processing

Could you please demonstrate to us how you envision processing and transporting the customer data in question to any location for preparation? Would you demonstrate to us that the customer, the preparer and the intended storage providers all have adequate bandwidth to process the set with its corresponding size? Would you tell us how the data set preparer takes into account the prevention of duplicates in order to prevent data cap abuse? Hopefully you understand the caution the overall community has for onboarding the wrong data. We understand the increased need for Filecoin+, however, we must not allow the program to be misused. Everything depends on a valuable and useful network, let's do our best to make this happen. Together.

herrehesse commented 1 year ago

We fully understand that extra due diligence is being done on large datacap requests given the large amounts of erroneous data lately. The focus should remain on keeping fraudulent datacap requests out.

Customer data

We have participated on behalf of Speedium Networks in saving useful data since the beginning of the Slingshot program. We have stored multiple datasets, explicitly selected on behalf of the Protocol Labs team to further the value of the network. The datasets in question consist of several important scientific studies such as cancer research, gnome analyses, observations of the galaxy, weather reports, DNA databases and much more.

The corresponding dataset (NIH NCBI Sequence Read Archive) is publicly available and stored through us on the Filecoin network for eternity and for the benefit of mankind. As a result, no customer contact is available.

We make a datacap request in connection with being able to store the useful, public and open datasets. Accessible for everyone. The current market for paying clients and corporate data (FIL-E) is still developing. By storing useful public data on the network, we can currently show opportunities and usability to potential paying customers. (FIL-E)

The reason for using the Filecoin+ network is that these datasets lend themselves perfectly to Filecoin's mission. Namely, to "store humanity's most important information." We strongly believe that useful information advances the entire Filecoin ecosystem, so we encourage everyone to ensure that the data for which a request is made is actually useful to everyone.

If it is purely a business request for encrypted data this should be FIL-e. Scraped data from websites without consent, CAR files created by other data preparers without their consent, unimportant data such as cooking courses, security footage or completely random images do not fulfill the mission of the Filecoin+ program and should not be accepted.

The exact size of each dataset is clear in advance and based on this specific size we can determine how much datacap is needed for the numbers of duplications required. The exact volumes are visible on the AWS buckets, Azure or Google cloud database where they are currently stored.

Files and Processing

We have realised a BGP endpoint in Amsterdam in the Netherlands, from here we download, pack and send the above mentioned datasets. Our bandwidth at this location is 80Gbps, sufficient to process 0.5PiB of raw data daily and send it to selected storage providers. We have provided 30 PiB of storage capacity and several machines for building CAR files and web servers to serve downloads.

We take our work as a packing provider for the Filecoin network extremely seriously and do our best to run demos with paying customers. We believe in this is for the future benefit of the whole network.

Our download locations such as AWS buckets, Azure or Google cloud tend to have bandwidth between 10 and 40 Gbps. The storage providers we select for data storage vary between 1 and 30 Gbps. We ensure that the larger data sets are distributed only to capable storage providers and monitor them hourly for reachability.

We work with Singularity which is still under development and sometimes suffer from duplicate deals constructed by this tool. This is mainly because our scale of 100-500 TiB daily packing can sometimes cause files to be processed twice when the data is changed in the buckets (for example the date of the file). In addition our selected storage providers are focused on reviewing data before processing and we recently started with regular duplicate checks on all of our available pools to ensure that duplicates are kept to an absolute minimum.

I expect to have sufficiently informed the community with this information. If there are any further questions, please let me know. We are in favor of full transparency and openness. Together we will move Filecoin forward and build the future of WEB3.

cryptowhizzard commented 1 year ago

REQUEST MOVED TO: #1554

simonkim0515 commented 1 year ago

Closing due to new application created.

aggregation-and-compliance-bot[bot] commented 9 months ago
Client f062392 does not follow the datacap usage rules. More info here. This application has been failing the requirements for 7 days. Please take appropiate action to fix the following DataCap usage problems. Criteria Treshold Reason
Percent of used DataCap stored with top provider < 75 The percent of Data from the client that is stored with their top provider is 100%. This should be less than 75%