filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] Ghost Byte Inc - Human PanGenomics Project [2/3] #1958

Closed GhostByteInc closed 1 year ago

GhostByteInc commented 1 year ago

Data Owner Name

Human PanGenomics Project

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://humanpangenome.org/

Social Media

https://twitter.com/HumanPangenome
https://www.facebook.com/genome.gov/videos/human-pangenome/150197303548436/

Total amount of DataCap being requested

5PiB

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1rc75rubsis6cz5wzbspwztedpmlxc2qqa2gn7ry

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

At Ghost Byte Inc, we are a dedicated storage provider looking to onboard data and contribute to the testing of onboarding tools. As an active participant in North America's weekly calls, we take pride in our involvement in the Filecoin community by assisting fellow members on the Slack channel, providing feedback on beta software, and consistently offering support within the community.

Our collaboration with industry partners is focused on fostering growth in web3 adoption, and we remain committed to pushing the boundaries of decentralized storage solutions. For more information on our activities and contributions, you can refer to this YouTube video: https://www.youtube.com/watch?v=6PejYUlN0AM

---

The Human Pangenome Reference Consortium is a collaborative initiative aimed at creating a more sophisticated and complete human reference genome. The current human reference genome, which serves as the most widely used resource in human genetics, is primarily based on merged haplotypes from over 20 individuals, with a single individual comprising most of the sequence. This linear composite structure contains biases and errors and lacks representation of global human genomic variation.

Recognizing the need for an improved reference genome, the Human Pangenome Reference Consortium was formed to develop a high-quality, graph-based, telomere-to-telomere representation of global genomic diversity. The consortium leverages innovations in technology, study design, and global partnerships to construct the highest-possible quality human pangenome reference.

The ultimate goal of the project is to improve data representation and streamline analyses for the routine assembly of complete diploid genomes. By incorporating a more accurate and diverse representation of global genomic variation, the human pangenome reference will enhance gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine. The project is also committed to addressing ethical considerations to ensure that the human pangenome reference is developed responsibly and fairly.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

Each parent in the trio was sequenced with Illumina short reads, each child was sequenced with Illumina short reads, 10X Genomics, Nanopore, PacBio CLR and HiFi, Bionano and Hi-C.

For nanopore datasets, each folder contains the fast5, fastq (basecalled with Guppy 2.3.5 flip flop with the high accuracy model), and a sequencing summary file.

For PacBio CLR data, each folder contains a subread bam file which can be converted to fasta/q using either bam2fastq or samtools fasta. The HiFi folders contain ccs.bam files which have already been converted from subreads into high-fidelity reads. As before, they can be converted to fasta/q using bam2fastq or samtools fasta.

For Bionano data, each folder contains both the assembled optical map (cmap) and the individual molecules (bnx.gz)

For the remaining short-read data, each folder contains one or more subfolders with fastq.gz files.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

singularity

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

https://www.dropbox.com/sh/ksod9gdgq1m0cp2/AADkprdDb4zdiO44sEvUDe8Ya?dl=0

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

Data retrieval is expected to be used for testing or disaster recovery purposes. All storage providers must allow retrievals at a reasonable rate, taking into account the current capabilities of the Filecoin network. To maintain the most efficient storage rates for the given use case, unsealed copies will not be required when storing this dataset.

Storage providers will be periodically tested to ensure that their data remains retrievable within a reasonable timeframe, set at a maximum of 12 hours. If any issues are identified, the problematic storage provider will be contacted to resolve the issue immediately or be removed from the project if necessary.

What is the expected retrieval frequency for this data

Sporadic

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent), Antarctica

How will you be distributing your data to storage providers

Others

How do you plan to choose storage providers

Slack, Big data exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

f01886797
f01963614
f01886690
f01998430
f01997834
f01611097

Some partners prefer not to disclose their identifying information publicly. If there are any additional storage providers reading this post who are interested in participating, please feel free to comment. We can replicate portions of the dataset for you if your involvement aligns with the project's ultimate objectives.

How do you plan to make deals to your storage providers

Others/custom tool

If you answered "Others/custom tool" in the previous question, enter the details here

DELTA

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

Sunnyiscoming commented 1 year ago

Datacap Request Trigger

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

1PiB

Client address

f1rc75rubsis6cz5wzbspwztedpmlxc2qqa2gn7ry

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f1rc75rubsis6cz5wzbspwztedpmlxc2qqa2gn7ry

DataCap allocation requested

256TiB

Id

97cae058-83a6-4837-9d69-db562a09682d

laudiacay commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacedpswsvhzo2qhvhroyxytc2oocn3bhqjfv2gf5mzgroodg7nuzz2w

Address

f1rc75rubsis6cz5wzbspwztedpmlxc2qqa2gn7ry

Datacap Allocated

256.00TiB

Signer Address

f1oc6qvenzp7wsriu7edyebb325gnaovktmujl7jq

Id

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedpswsvhzo2qhvhroyxytc2oocn3bhqjfv2gf5mzgroodg7nuzz2w

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 14 days, so for now it is being closed. Please feel free to re-open if this is relevant, or start a new application for DataCap anytime. Thank you!