filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] Ghost Byte Inc - Human PanGenomics Project [1/3] #1233

Closed Trevor-K-Smith closed 1 year ago

Trevor-K-Smith commented 1 year ago

Large Dataset Notary Application

To apply for DataCap to onboard your dataset to Filecoin, please fill out the following.

Core Information

image

DP Info

Client Info

Data Info

Please respond to the questions below by replacing the text saying "Please answer here". Include as much detail as you can in your answer.

Project details

Share a brief history of your project and organization.

Ghost Byte Inc is a storage provider seeking to onboard data to meet the high demand of FIL+ for itself and its partners. Ghost Byte has a history of actively participating in NA weekly calls, helping community members on the slack channel, testing beta software with feedback, and overall ongoing support in the community of filecoin. Ghost Byte works with industry partners to assist the growth in web3 adoption.
Ref: https://www.youtube.com/watch?v=6PejYUlN0AM

What is the primary source of funding for this project?

Ghost Byte Inc

What other projects/ecosystem stakeholders is this project associated with?

Ghost Byte Inc

Use-case details

Describe the data being stored onto Filecoin

Each parent in the trio was sequenced with Illumina short reads, each child was sequenced with Illumina short reads, 10X Genomics, Nanopore, PacBio CLR and HiFi, Bionano and Hi-C.

For nanopore datasets, each folder contains the fast5, fastq (basecalled with Guppy 2.3.5 flip flop with the high accuracy model), and a sequencing summary file.

For PacBio CLR data, each folder contains a subread bam file which can be converted to fasta/q using either bam2fastq or samtools fasta. The HiFi folders contain ccs.bam files which have already been converted from subreads into high-fidelity reads. As before, they can be converted to fasta/q using bam2fastq or samtools fasta.

For Bionano data, each folder contains both the assembled optical map (cmap) and the individual molecules (bnx.gz)

For the remaining short-read data, each folder contains one or more subfolders with fastq.gz files.

Below is short blub on details of the data.

Where was the data in this dataset sourced from?

This data is being replicated from AWS Opendata to Filecoin. The dataset being replicated is Human PanGenomics Project. This dataset contains 188,823 Total Objects and is 1.32PiB Total Size. The data will be replicated a total of 10 times for a total datacap request of 13.2 PiB.
Ref: https://registry.opendata.aws/hpgp-data/

Can you share a sample of the data? A link to a file, an image, a table, etc., are good ways to do this.

https://www.dropbox.com/sh/ksod9gdgq1m0cp2/AADkprdDb4zdiO44sEvUDe8Ya?dl=0

Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data).

Public -https://registry.opendata.aws/hpgp-data/

What is the expected retrieval frequency for this data?

1-3 Year

For how long do you plan to keep this dataset stored on Filecoin?

540 Days, subject to renewal when the time comes.

DataCap allocation plan

In which geographies (countries, regions) do you plan on making storage deals?

Global partners. No restrictions, will spread out as much as possible.

How will you be distributing your data to storage providers? Is there an offline data transfer process?

Data will be send over boostd to participating storage providers. Otherwise, offline deals can be done for those with special requirements.

How do you plan on choosing the storage providers with whom you will be making deals? This should include a plan to ensure the data is retrievable in the future both by you and others.

Storage Providers will be found in the active community slack, partners met at events, and online data sources. 1 per actor, 2 per organization, spread as evenly across the globe a possible. Total 10 replications. SP's will be confirmed ahead of replication of cars that they intend to allow the cars to be accessible and retrievable. Car files for this replication will not be enforced to keep unsealed sectors as this is disaster recovery data, and retrievals will be low. 

How will you be distributing deals across storage providers?

Singularity will be used to serve up the deals and track the progress of each CAR file being replicated. 1 per actor, 2 per organization, spread as evenly across the globe a possible. Total 10 replications.

Do you have the resources/funding to start making deals as soon as you receive DataCap? What support from the community would help you onboard onto Filecoin?

Yes, we have the resources to get started right away. We do not need help at this time. Thank you!
large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

simonkim0515 commented 1 year ago

Datacap Request Trigger

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

100TiB

Client address

f1bqhekdcmuqgnajsdkvmhqefaj7jexnob565rzya

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f1uid3opdhqtw4to7wj3jjrcyb4xrbpgm7ang7izi

DataCap allocation requested

50TiB

Id

a6107783-4d5d-4a00-bc75-9f1f46f4443c

IreneYoung commented 1 year ago

@Trevor-K-Smith How many SPs you intend to make deal with? And can you list SPs you have contacted with at present?

large-datacap-requests[bot] commented 1 year ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!
Trevor-K-Smith commented 1 year ago

@Trevor-K-Smith How many SPs you intend to make deal with? And can you list SPs you have contacted with at present?

Hi Irene!

There will be minimum of 10 SPs per this dataset. There will likely be many more, as we have internal tooling to geographically and organizationally split up the replication between SPs at individual CARs level, similar to SPADE for slingshot.

SPs will include, but certainly not limited to...

Seal Storage Hut8 ISOTechnics PiKNiK Linkspeed Nonentrophy

Additional SPs located on slack and BDE.

Let me know if any questions

large-datacap-requests[bot] commented 1 year ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!
Trevor-K-Smith commented 1 year ago

@simonkim0515 @raghavrmadya

Can i kindly get this application reviewed? Thanks

herrehesse commented 1 year ago

Dear Filecoin+ Github applicant,

We have noticed that the dataset is already (partly) on chain. While we appreciate your enthusiasm to contribute to the Filecoin network, we want to remind you that this behaviour may not be beneficial to the network. Can you explain to me what happend here?

Thank you for your understanding and cooperation.

Screenshot 2023-02-22 at 11 27 40
Trevor-K-Smith commented 1 year ago

@herrehesse

Looking at the dates on all the other applications, they are all after mine.

I performed research and selected a dataset to be on boarded to the network. At the time of issue creation, the dataset was not on the filecoin network. The applications posted above all have creation dates well over a month after mine was created. They did not do the research to find a dataset not already in process to be onboarded to the network.

My applications are clearly labeled. They should have seen them.

Why the other applications are being approved when i had prior approval for this entire dataset?

GhostByteInc commented 1 year ago

@simonkim0515 @raghavrmadya @Kevin-FF-USA @galen-mcandrew

Hello

Kindly pinging again as applications continue to sit with no movement. Can i please be informed on what the issue/holdup/waiting point for this application is?

Thanks,

--Trevor K Smith

large-datacap-requests[bot] commented 1 year ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!
large-datacap-requests[bot] commented 1 year ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!
xinaxu commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacedlwxbkknt6sok5wncuidr2faheevxkxqgfo6f36xdcg4p4ebrq2m

Address

f1uid3opdhqtw4to7wj3jjrcyb4xrbpgm7ang7izi

Datacap Allocated

50.00TiB

Signer Address

f1k3ysofkrrmqcot6fkx4wnezpczlltpirmrpsgui

Id

a6107783-4d5d-4a00-bc75-9f1f46f4443c

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedlwxbkknt6sok5wncuidr2faheevxkxqgfo6f36xdcg4p4ebrq2m

xinaxu commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacedrhkpvi3qhfyndkevpnsbfijd4euynmdqo5ffg5ivli3zjkoxers

Address

f1uid3opdhqtw4to7wj3jjrcyb4xrbpgm7ang7izi

Datacap Allocated

50.00TiB

Signer Address

f1k3ysofkrrmqcot6fkx4wnezpczlltpirmrpsgui

Id

a6107783-4d5d-4a00-bc75-9f1f46f4443c

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedrhkpvi3qhfyndkevpnsbfijd4euynmdqo5ffg5ivli3zjkoxers

aggregation-and-compliance-bot[bot] commented 9 months ago
Client f01638263 does not follow the datacap usage rules. More info here. This application has been failing the requirements for 7 days. Please take appropiate action to fix the following DataCap usage problems. Criteria Treshold Reason
Percent of used DataCap stored with top provider < 75 The percent of Data from the client that is stored with their top provider is 100%. This should be less than 75%
data-programs commented 5 months ago
KYC

This user’s identity has been verified through filplus.storage