filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
109 stars 62 forks source link

[DataCap Application] Speedium - 1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7 [Part 2 / 3] #1516

Closed herrehesse closed 1 year ago

herrehesse commented 1 year ago

Data Owner Name

International Genome (EMBL-EBI)

Data Owner Country/Region

United Kingdom

Data Owner Industry

Life Science / Healthcare

Website

https://www.internationalgenome.org/category/phase-3/ & https://www.ebi.ac.uk/

Social Media

https://www.facebook.com/EMBLEBI/
https://github.com/topics/1000genomes

Total amount of DataCap being requested

5PiB

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1klrehuwhzfw6boiiywdi7pikztsob4kkieivcwq

Custom multisig

Identifier

No response

Share a brief history of your project and organization

Since its launch, the Filecoin network has become an important player in the decentralised storage space, offering a secure and transparent alternative to traditional data storage solutions.

We as Speedium / DCENT have been engaged with storing real and valuable datasets on the Filecoin network since Slingshot 2.6 and have been actively developing tools to improve the process. We are always on the lookout for new and useful client data to onboard.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

This dataset contains alignment files and short nucleotide, copy number, repeat expansion (STR) and structural variant call files from the 1000 Genomes Project Phase 3 dataset (n=3202) using Illumina DRAGEN v3.5.7b and v3.7.6 software.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

IPFS, lotus, singularity, others/custom tool

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

https://registry.opendata.aws/ilmn-dragen-1kgp/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, North America, South America, Europe, Australia (continent)

How will you be distributing your data to storage providers

HTTP or FTP server, IPFS, Shipping hard drives, Lotus built-in data transfer

How do you plan to choose storage providers

Slack, Big data exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

We do not know the exact providers in advance.

How do you plan to make deals to your storage providers

Boost client, Lotus client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

cryptowhizzard commented 1 year ago

Hi,

Hidde is taking over part of my work here. This one is valid for Speedium / Dcent as followup for my previous requests.

raghavrmadya commented 1 year ago

Datacap Request Trigger

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

1PiB

Client address

f1klrehuwhzfw6boiiywdi7pikztsob4kkieivcwq

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Multisig Notary address

f01858410

Client address

f1klrehuwhzfw6boiiywdi7pikztsob4kkieivcwq

DataCap allocation requested

256TiB

Id

3aef8c09-d2c6-49fb-97b2-819117833080

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report[^1]

There is no previous allocation for this issue.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

NiwanDao commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceczl2qtkcu4g4cki6hy2ummttxc5as7wdv6blwp523obajajju73w

Address

f1klrehuwhzfw6boiiywdi7pikztsob4kkieivcwq

Datacap Allocated

256.00TiB

Signer Address

f1a2lia2cwwekeubwo4nppt4v4vebxs2frozarz3q

Id

3aef8c09-d2c6-49fb-97b2-819117833080

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceczl2qtkcu4g4cki6hy2ummttxc5as7wdv6blwp523obajajju73w

kernelogic commented 1 year ago

Public data. Willing to support.

kernelogic commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacealmwy3euhujs7rieg5monkv6nj4m4riryg7vi333ciyktwysmdjk

Address

f1klrehuwhzfw6boiiywdi7pikztsob4kkieivcwq

Datacap Allocated

256.00TiB

Signer Address

f1yjhnsoga2ccnepb7t3p3ov5fzom3syhsuinxexa

Id

3aef8c09-d2c6-49fb-97b2-819117833080

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacealmwy3euhujs7rieg5monkv6nj4m4riryg7vi333ciyktwysmdjk

yuyuhe123 commented 1 year ago

Why can you apply for so much data (30P) on your own, and on behalf of different organizations Are you an administrator of FIL+?

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1516 https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1514 https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1513 https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1512 https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1511

laff2022 commented 1 year ago

Dear Applicant,

Due to the increased amount of erroneous/wrong Filecoin+ data recently, on behalf of the entire community, we feel compelled to go deeper into datacap requests. Hereby to ensure that the overall value of the Filecoin network and Filecoin+ program increases and is not abused.

Please answer the questions below as comprehensively as possible.

Customer data

Could you demonstrate exactly how and to what extent customer contact occurred? We expect that for the onboarding of customers with the scale of an LDN there would have been at least multiple email and perhaps several chat conversations preceding it. A single email with an agreement does not qualify here.

Did the customer specify the amount of data involved in this relevant correspondence?

Why does the customer in question want to use the Filecoin+ program?

Should this only be soley for acquiring datacap this is of course out of the question. The customer must have a legitimate reason for wanting to use the Filecoin+ program which is intended as a program to store useful and public datasets on the network.

Why is the customer data considered Filecoin+ eligible? (As an intermediate solution Filecoin offers the FIL-E program or the glif.io website for business datasets that do not meet the requirements for a Filecoin+ dataset)

Files and Processing

Could you please demonstrate to us how you envision processing and transporting the customer data in question to any location for preparation? Would you demonstrate to us that the customer, the preparer and the intended storage providers all have adequate bandwidth to process the set with its corresponding size? Would you tell us how the data set preparer takes into account the prevention of duplicates in order to prevent data cap abuse? Hopefully you understand the caution the overall community has for onboarding the wrong data. We understand the increased need for Filecoin+, however, we must not allow the program to be misused. Everything depends on a valuable and useful network, let's do our best to make this happen. Together.

Joss-Hua commented 1 year ago

I remember the upper limit is 500 TiB/Week. Why can this LDN break the upper limit? 0_0

image

herrehesse commented 1 year ago

We fully understand that extra due diligence is being done on large datacap requests given the large amounts of erroneous data lately. The focus should remain on keeping fraudulent datacap requests out.

Customer data

We have participated on behalf of Speedium Networks in saving useful data since the beginning of the Slingshot program. We have stored multiple datasets, explicitly selected on behalf of the Protocol Labs team to further the value of the network. The datasets in question consist of several important scientific studies such as cancer research, gnome analyses, observations of the galaxy, weather reports, DNA databases and much more.

The corresponding dataset (1000 Genomes Phase 3 Reanalysis with DRAGEN 3.5 and 3.7) is publicly available and stored through us on the Filecoin network for eternity and for the benefit of mankind. As a result, no customer contact is available.

We make a datacap request in connection with being able to store the useful, public and open datasets. Accessible for everyone. The current market for paying clients and corporate data (FIL-E) is still developing. By storing useful public data on the network, we can currently show opportunities and usability to potential paying customers. (FIL-E)

The reason for using the Filecoin+ network is that these datasets lend themselves perfectly to Filecoin's mission. Namely, to "store humanity's most important information." We strongly believe that useful information advances the entire Filecoin ecosystem, so we encourage everyone to ensure that the data for which a request is made is actually useful to everyone.

If it is purely a business request for encrypted data this should be FIL-e. Scraped data from websites without consent, CAR files created by other data preparers without their consent, unimportant data such as cooking courses, security footage or completely random images do not fulfill the mission of the Filecoin+ program and should not be accepted.

The exact size of each dataset is clear in advance and based on this specific size we can determine how much datacap is needed for the numbers of duplications required. The exact volumes are visible on the AWS buckets, Azure or Google cloud database where they are currently stored.

Files and Processing

We have realised a BGP endpoint in Amsterdam in the Netherlands, from here we download, pack and send the above mentioned datasets. Our bandwidth at this location is 80Gbps, sufficient to process 0.5PiB of raw data daily and send it to selected storage providers. We have provided 30 PiB of storage capacity and several machines for building CAR files and web servers to serve downloads.

We take our work as a packing provider for the Filecoin network extremely seriously and do our best to run demos with paying customers. We believe in this is for the future benefit of the whole network.

Our download locations such as AWS buckets, Azure or Google cloud tend to have bandwidth between 10 and 40 Gbps. The storage providers we select for data storage vary between 1 and 30 Gbps. We ensure that the larger data sets are distributed only to capable storage providers and monitor them hourly for reachability.

We work with Singularity which is still under development and sometimes suffer from duplicate deals constructed by this tool. This is mainly because our scale of 100-500 TiB daily packing can sometimes cause files to be processed twice when the data is changed in the buckets (for example the date of the file). In addition our selected storage providers are focused on reviewing data before processing and we recently started with regular duplicate checks on all of our available pools to ensure that duplicates are kept to an absolute minimum.

I expect to have sufficiently informed the community with this information. If there are any further questions, please let me know. We are in favor of full transparency and openness. Together we will move Filecoin forward and build the future of WEB3.

herrehesse commented 1 year ago

@Joss-Hua - Good question! 500T weekly is more than enough.

chuangyuhudong commented 1 year ago

To make this clear, you are a storage provider but also platform?

This seems very vague to me and just a way to grow your miner with the FIL+ multiplier.

I do not support this kind of request in any way and would advise notaries not to sign.

herrehesse commented 1 year ago

I find it very childish of the applicant in question (@chuangyuhudong) to respond in this way, as it is a copy of a response from me to one of his own datacap requests, which is doubtful to be following the Filecoin+ ecosystem rules.

Is this the level we have descended to?

I would love to have an adult level discussion about the misused datacap for fraudulent practices or completely useless files. The Filecoin+ program is meant to advance the entire Filecoin ecosystem by increasing the value of data storage, it is now mostly used for personal gain and abusive growth with spoofed files or fake data.

It has to stop, 99% down from our all-time-high is enough. @raghavrmadya @cryptowhizzard

cryptowhizzard commented 1 year ago

REQUEST MOVED TO: #1549

simonkim0515 commented 1 year ago

Closing due to new application created.