filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] Ghost Byte Inc - Human PanGenomics Project [1/3] #1957

Closed GhostByteInc closed 1 year ago

GhostByteInc commented 1 year ago

Data Owner Name

Human PanGenomics Project

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://humanpangenome.org/

Social Media

https://twitter.com/HumanPangenome
https://www.facebook.com/genome.gov/videos/human-pangenome/150197303548436/

Total amount of DataCap being requested

5PiB

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1stkkjijjh2l33f3qpx5z63fdz3ukfnki5tku5jy

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

At Ghost Byte Inc, we are a dedicated storage provider looking to onboard data and contribute to the testing of onboarding tools. As an active participant in North America's weekly calls, we take pride in our involvement in the Filecoin community by assisting fellow members on the Slack channel, providing feedback on beta software, and consistently offering support within the community.

Our collaboration with industry partners is focused on fostering growth in web3 adoption, and we remain committed to pushing the boundaries of decentralized storage solutions. For more information on our activities and contributions, you can refer to this YouTube video: https://www.youtube.com/watch?v=6PejYUlN0AM

---

The Human Pangenome Reference Consortium is a collaborative initiative aimed at creating a more sophisticated and complete human reference genome. The current human reference genome, which serves as the most widely used resource in human genetics, is primarily based on merged haplotypes from over 20 individuals, with a single individual comprising most of the sequence. This linear composite structure contains biases and errors and lacks representation of global human genomic variation.

Recognizing the need for an improved reference genome, the Human Pangenome Reference Consortium was formed to develop a high-quality, graph-based, telomere-to-telomere representation of global genomic diversity. The consortium leverages innovations in technology, study design, and global partnerships to construct the highest-possible quality human pangenome reference.

The ultimate goal of the project is to improve data representation and streamline analyses for the routine assembly of complete diploid genomes. By incorporating a more accurate and diverse representation of global genomic variation, the human pangenome reference will enhance gene-disease association studies across populations, expand the scope of genomics research to the most repetitive and polymorphic regions of the genome, and serve as the ultimate genetic resource for future biomedical research and precision medicine. The project is also committed to addressing ethical considerations to ensure that the human pangenome reference is developed responsibly and fairly.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

Each parent in the trio was sequenced with Illumina short reads, each child was sequenced with Illumina short reads, 10X Genomics, Nanopore, PacBio CLR and HiFi, Bionano and Hi-C.

For nanopore datasets, each folder contains the fast5, fastq (basecalled with Guppy 2.3.5 flip flop with the high accuracy model), and a sequencing summary file.

For PacBio CLR data, each folder contains a subread bam file which can be converted to fasta/q using either bam2fastq or samtools fasta. The HiFi folders contain ccs.bam files which have already been converted from subreads into high-fidelity reads. As before, they can be converted to fasta/q using bam2fastq or samtools fasta.

For Bionano data, each folder contains both the assembled optical map (cmap) and the individual molecules (bnx.gz)

For the remaining short-read data, each folder contains one or more subfolders with fastq.gz files.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

singularity

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

https://www.dropbox.com/sh/ksod9gdgq1m0cp2/AADkprdDb4zdiO44sEvUDe8Ya?dl=0

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

Data retrieval is expected to be used for testing or disaster recovery purposes. All storage providers must allow retrievals at a reasonable rate, taking into account the current capabilities of the Filecoin network. To maintain the most efficient storage rates for the given use case, unsealed copies will not be required when storing this dataset.

Storage providers will be periodically tested to ensure that their data remains retrievable within a reasonable timeframe, set at a maximum of 12 hours. If any issues are identified, the problematic storage provider will be contacted to resolve the issue immediately or be removed from the project if necessary.

What is the expected retrieval frequency for this data

Sporadic

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent), Antarctica

How will you be distributing your data to storage providers

Others

How do you plan to choose storage providers

Slack, Big data exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

f01886797
f01963614
f01886690
f01998430
f01997834
f01611097

Some partners prefer not to disclose their identifying information publicly. If there are any additional storage providers reading this post who are interested in participating, please feel free to comment. We can replicate portions of the dataset for you if your involvement aligns with the project's ultimate objectives.

How do you plan to make deals to your storage providers

Others/custom tool

If you answered "Others/custom tool" in the previous question, enter the details here

DELTA

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

Sunnyiscoming commented 1 year ago

Datacap Request Trigger

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

1PiB

Client address

f1stkkjijjh2l33f3qpx5z63fdz3ukfnki5tku5jy

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f1stkkjijjh2l33f3qpx5z63fdz3ukfnki5tku5jy

DataCap allocation requested

256TiB

Id

dfa4f9a2-c9cc-4a4d-bd7d-fcfe3f81173c

laudiacay commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecdrlt3ijjnxbzjl72nsw6q42ksy63muqmasvpf7z5r5klnsbg7w2

Address

f1stkkjijjh2l33f3qpx5z63fdz3ukfnki5tku5jy

Datacap Allocated

256.00TiB

Signer Address

f1oc6qvenzp7wsriu7edyebb325gnaovktmujl7jq

Id

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecdrlt3ijjnxbzjl72nsw6q42ksy63muqmasvpf7z5r5klnsbg7w2

dannyob commented 1 year ago

Hey folks, it's been a while since I've notarised some LDNs, and the Fil+ project has matured a lot in the last few months. I'm just going to spell out here transparently what I've checked here, and what I'm planning to do.

I was contacted by the Delta tools folks, and asked if I could support this application ( which also includes #1958 and #1959 ). I've established that this a legitimate collaboration between their team, and GhostByte/Edgevana (@GhostByteInc / Trevor K Smith, it'd be great if you could expand upon that relationship here). I'll support this, but I'm keen to see that we're seeing this data being shared between multiple independent SPs, and that it passes the usual CID duplication tests.).

I'll start with this application. I can also support the other two, but maybe you can let me know what notaries you have lined up. Previous (maybe ancient at Filecoin ecosystem speeds?) experience indicate to me that you should have maybe four notaries lined up and tag-teaming -- we can't sign the same application twice in a row, but we can alternate. Happy to fill one of those slots, and let you take the lead on which applications I should be signing, when.

Hopefully that makes sense! Good luck with the upload!

d.

dannyob commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceaav5bixcaui72vumz7olit3wd6n3ab3yiytr3unb7uni5mjaknqa

Address

f1stkkjijjh2l33f3qpx5z63fdz3ukfnki5tku5jy

Datacap Allocated

256.00TiB

Signer Address

f1k6wwevxvp466ybil7y2scqlhtnrz5atjkkyvm4a

Id

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceaav5bixcaui72vumz7olit3wd6n3ab3yiytr3unb7uni5mjaknqa

GhostByteInc commented 1 year ago

Hey folks, it's been a while since I've notarised some LDNs, and the Fil+ project has matured a lot in the last few months. I'm just going to spell out here transparently what I've checked here, and what I'm planning to do.

I was contacted by the Delta tools folks, and asked if I could support this application ( which also includes #1958 and #1959 ). I've established that this a legitimate collaboration between their team, and GhostByte/Edgevana (@GhostByteInc / Trevor K Smith, it'd be great if you could expand upon that relationship here). I'll support this, but I'm keen to see that we're seeing this data being shared between multiple independent SPs, and that it passes the usual CID duplication tests.).

I'll start with this application. I can also support the other two, but maybe you can let me know what notaries you have lined up. Previous (maybe ancient at Filecoin ecosystem speeds?) experience indicate to me that you should have maybe four notaries lined up and tag-teaming -- we can't sign the same application twice in a row, but we can alternate. Happy to fill one of those slots, and let you take the lead on which applications I should be signing, when.

Hopefully that makes sense! Good luck with the upload!

d.

Hey Danny,

Would be very glad to have your support :)

Still working on getting notaries to review the application for additional support to meet the 4 supporting notaries. I'm aware and constantly working it.

I have faith laudiacay will show consistent support. Ive got 2 prospects for this that seem willing but will wait until i see a signature before calling them out.. ha

Much appreciated,

--Trevor

laudiacay commented 1 year ago

oh whoops

laudiacay commented 1 year ago

let me get my ledger

laudiacay commented 1 year ago
Capture d’écran 2023-05-16 à 9 42 34 AM

where u at buddy

I'll sign just lmk

laudiacay commented 1 year ago

wait i already signed

i am a fool

GhostByteInc commented 1 year ago

wait i already signed

i am a fool

hahah yes! I was mentioning you for future support :)

Thanks so much! @laudiacay

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 14 days, so for now it is being closed. Please feel free to contact the Fil+ Gov team to re-open the application if it is still being processed. Thank you!