filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] Ghost Byte Inc - Human PanGenomics Project [1/3] #1232

Closed GhostByteInc closed 1 year ago

GhostByteInc commented 1 year ago

Large Dataset Notary Application

To apply for DataCap to onboard your dataset to Filecoin, please fill out the following.

Core Information

DP Info

Client Info

Data Info

Please respond to the questions below by replacing the text saying "Please answer here". Include as much detail as you can in your answer.

Project details

Share a brief history of your project and organization.

Ghost Byte Inc is a storage provider seeking to onboard data to meet the high demand of FIL+ for itself and its partners. Ghost Byte has a history of actively participating in NA weekly calls, helping community members on the slack channel, testing beta software with feedback, and overall ongoing support in the community of filecoin. Ghost Byte works with industry partners to assist the growth in web3 adoption.
Ref: https://www.youtube.com/watch?v=6PejYUlN0AM

What is the primary source of funding for this project?

Ghost Byte Inc

What other projects/ecosystem stakeholders is this project associated with?

Ghost Byte Inc

Use-case details

Describe the data being stored onto Filecoin

Each parent in the trio was sequenced with Illumina short reads, each child was sequenced with Illumina short reads, 10X Genomics, Nanopore, PacBio CLR and HiFi, Bionano and Hi-C.

For nanopore datasets, each folder contains the fast5, fastq (basecalled with Guppy 2.3.5 flip flop with the high accuracy model), and a sequencing summary file.

For PacBio CLR data, each folder contains a subread bam file which can be converted to fasta/q using either bam2fastq or samtools fasta. The HiFi folders contain ccs.bam files which have already been converted from subreads into high-fidelity reads. As before, they can be converted to fasta/q using bam2fastq or samtools fasta.

For Bionano data, each folder contains both the assembled optical map (cmap) and the individual molecules (bnx.gz)

For the remaining short-read data, each folder contains one or more subfolders with fastq.gz files.

Below is short blub on details of the data.

Where was the data in this dataset sourced from?

This data is being replicated from AWS Opendata to Filecoin. The dataset being replicated is Human PanGenomics Project. This dataset contains 188,823 Total Objects and is 1.32PiB Total Size. The data will be replicated a total of 10 times for a total datacap request of 13.2 PiB.
Ref: https://registry.opendata.aws/hpgp-data/

Can you share a sample of the data? A link to a file, an image, a table, etc., are good ways to do this.

https://www.dropbox.com/sh/ksod9gdgq1m0cp2/AADkprdDb4zdiO44sEvUDe8Ya?dl=0

Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data).

Public -https://registry.opendata.aws/hpgp-data/

What is the expected retrieval frequency for this data?

1-3 Year

For how long do you plan to keep this dataset stored on Filecoin?

540 Days, subject to renewal when the time comes.

DataCap allocation plan

In which geographies (countries, regions) do you plan on making storage deals?

Global partners. No restrictions, will spread out as much as possible.

How will you be distributing your data to storage providers? Is there an offline data transfer process?

Data will be send over boostd to participating storage providers. Otherwise, offline deals can be done for those with special requirements.

How do you plan on choosing the storage providers with whom you will be making deals? This should include a plan to ensure the data is retrievable in the future both by you and others.

Storage Providers will be found in the active community slack, partners met at events, and online data sources. 1 per actor, 2 per organization, spread as evenly across the globe a possible. Total 10 replications.

How will you be distributing deals across storage providers?

Singularity will be used to serve up the deals and track the progress of each CAR file being replicated. 1 per actor, 2 per organization, spread as evenly across the globe a possible. Total 10 replications.

Do you have the resources/funding to start making deals as soon as you receive DataCap? What support from the community would help you onboard onto Filecoin?

Yes, we have the resources to get started right away. We do not need help at this time. Thank you!
large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.