filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] PUBLIC DATA-COVID-19 Genome Sequence Dataset [2/2] #1881

Closed nora310 closed 10 months ago

nora310 commented 1 year ago

Data Owner Name

National Library of Medicine (NLM)

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://www.ncbi.nlm.nih.gov/sra/docs/sra-aws-download/

Social Media

twitter-https://twitter.com/NIH_OSP
youtube-https://www.youtube.com/@nihofficeofsciencepolicy3005/videos

Total amount of DataCap being requested

5PiB

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1ynotqzocbg2mzsnhtie6ozkyzxtkkdj2n3ooily

Data Type of Application

None

Custom multisig

Identifier

No response

Share a brief history of your project and organization

A centralized sequence repository for all records containing sequence associated with the novel corona virus (SARS-CoV-2) submitted to the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA). Included are both the original sequences submitted by the principal investigator as well as SRA-processed sequences that require the SRA Toolkit for analysis. Additionally, submitter provided metadata included in associated BioSample and BioProject records is available alongside NCBI calculated data, such k-mer based taxonomy analysis results, contiguous assemblies (contigs) and associated statistics such as contig length, blast results for the assembled contigs, contig annotation, blast databases of contigs and their annotated peptides, and VCF files generated for each record relative to the SARS-CoV-2 RefSeq record. Finally, metadata is additionally made available in parquet format to facilitate search and filtering using the AWS Athena Service.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

Genomic sequence reads of SARS-CoV-2 and related coronaviridae, organized by NCBI accession. Files in the sra-src folder are in FASTQ, BAM, or CRAM format (original submission); files in the run folder are in .sra format and require the SRA Toolkit; Metadata for sra-pub-sars-cov2 in an Athena-queryable format.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

IPFS, lotus, singularity

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

https://registry.opendata.aws/ncbi-covid-19/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, North America, Europe

How will you be distributing your data to storage providers

IPFS, Shipping hard drives

How do you plan to choose storage providers

Slack, Big data exchange

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

No response

How do you plan to make deals to your storage providers

Lotus client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

ghost commented 10 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 10 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 51.72% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

filplus-checker-app[bot] commented 10 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 51.72% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

ghost commented 10 months ago

SPs provided

f02218611 | Dave | Dulles Town Center, Virginia, US | no | Dave f01844118 | Garry | no email here | New York City, New York, US | no | garry f01844232 | Lukas | Dallas, Texas, US | no | Lukas f01844043 | Abel | Abel | Ashburn, US | no | Abel f01843994 | Hans | individual | Dulles Town Center, Virginia, US | no | hans6

Emailing SPs to confirm entity and location confirmation. All emails bounced, do not exist. Also no email for Garry in NYC. Closing this until SPs confirmed