filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] CryptoCraft-Public Genomic Data Repositories and Resources #2149

Closed jyma closed 7 months ago

jyma commented 1 year ago

Data Owner Name

CryptoCraft-Public Dataset

What is your role related to the dataset

Storage provider filling out application on behalf of the data owner

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare


Social Media

Total amount of DataCap being requested


Expected size of single dataset (one copy)


Number of replicas to store


Weekly allocation of DataCap requested


On-chain address for first allocation


Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig


No response

Share a brief history of your project and organization

Since the era of the space race, our team has been actively engaged as a storage provider, extending technical services to collaborative partners across the Asian region. We've successfully seal approximately 5PB of DC and currently possess an additional 15PB of CC that requires fill into DataCap. My partners and I have diligently secured the requisite pledge fil. Kindly rest assured, as we stand fully prepared.

Is this project associated with other projects/ecosystem stakeholders?


If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

1. Encyclopedia of DNA Elements (ENCODE)
- The Encyclopedia of DNA Elements (ENCODE) Consortium is an international collaboration of research groups funded by the National Human Genome Research Institute (NHGRI). The goal of ENCODE is to build a comprehensive parts list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that control cells and circumstances in which a gene is active. ENCODE investigators employ a variety of assays and methods to identify functional elements. The discovery and annotation of gene elements is accomplished primarily by sequencing a diverse range of RNA sources, comparative genomics, integrative bioinformatic methods, and human curation. Regulatory elements are typically investigated through DNA hypersensitivity assays, assays of DNA methylation, and immunoprecipitation (IP) of proteins that interact with DNA and RNA, i.e., modified histones, transcription factors, chromatin regulators, and RNA-binding proteins, followed by sequencing.
- Data Source:  aws s3 ls --no-sign-request s3://encode-public/
- Size: 1.2P

2. 4D Nucleome (4DN)
- The goal of the National Institutes of Health (NIH) Common Fund’s 4D Nucleome (4DN) program is to study the three-dimensional organization of the nucleus in space and time (the 4th dimension). The nucleus of a cell contains DNA, the genetic “blueprint” that encodes all of the genes a living organism uses to produce proteins needed to carry out life-sustaining cellular functions. Understanding the conformation of the nuclear DNA and how it is maintained or changes in response to environmental and cellular cues over time will provide insights into basic biology as well as aspects of human health and disease. The 4DN is an international consortium of researchers who generate data that include results from a variety of genomics and imaging assays with a focus on, but not exclusive to, those that demonstrate close contact between chromatin loci that are non-adjacent on the linear DNA sequence of chromosomes. Additional assays probe the nuclear landscape in the context of interactions of chromatin with specific proteins, RNAs and epigenetic changes.
- Size: 180T
- Data Source: s3://4dn-open-data-public/

3.NIH NCBI Sequence Read Archive (SRA) 
The Sequence Read Archive (SRA), produced by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM) at the National Institutes of Health (NIH), stores raw DNA sequencing data and alignment information from high-throughput sequencing platforms. The SRA provides open access to these biological sequence data to support the research community's efforts to enhance reproducibility and make new discoveries by comparing data sets. Buckets in this registry contain public SRA data in the original (user submitted) format from select high value and newly-released studies as well as all public-access SRA formatted ETL+BQS data. Also included is all SRA metadata that can be leveraged for attribute-based data discovery.
- Size: 1.1P
- Data Source: aws s3 ls --no-sign-request s3://sra-pub-src-1/

4.UCSC Genome Browser Sequence and Annotations
- The UCSC Genome Browser is an online graphical viewer for genomes, a genome browser, hosted by the University of California, Santa Cruz (UCSC). The interactive website offers access to genome sequence data from a variety of vertebrate and invertebrate species and major model organisms, integrated with a large collection of aligned annotations. This dataset is a copy of the MySQL tables in MyISAM binary and tab-sep format and all binary files in custom formats, sometimes referred as 'gbdb'-files. Data from the UCSC Genome Browser is free and open for use by anyone. However, every genome annotation track has been created by an academic research group, or, in a few cases, by commercial companies. Please acknowledge them by citing them. The information can be found by going to, selecting the respective genome assembly and clicking on the data track. At the end of the documentation, we provide a list of references and acknowledgements.
- Size:  81.7 TiB
- Data Source:  aws s3 ls --no-sign-request s3://genome-browser/

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (City and Country)


If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?


If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

I am not entirely certain, maybe partof it already exist within the Filecoin network. However, what I can confirm is that the data does not overlap with the DC that our node has encapsulated.

Please share a sample of the data

1. Encyclopedia of DNA Elements (ENCODE)
- Data Source:  aws s3 ls --no-sign-request s3://encode-public/

2. 4D Nucleome (4DN)
- Data Source: s3://4dn-open-data-public/

3.NIH NCBI Sequence Read Archive (SRA) 
- Data Source: aws s3 ls --no-sign-request s3://sra-pub-src-1/

4.UCSC Genome Browser Sequence and Annotations 
- Data Source:  aws s3 ls --no-sign-request s3://genome-browser/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data


For how long do you plan to keep this dataset stored on Filecoin


In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, North America

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), HTTP or FTP server, Shipping hard drives

How do you plan to choose storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

f02228866 - Feige IT, Tokyo, JP
f01761579 - Feige IT, Hangzhou, CN
f01315130 - Ouruan IT, Chengdu, CN
f02240216 - Lianxing storage, Tokyo, JP
f0673990 - Dayan IT, Hangzhou, CN

How do you plan to make deals to your storage providers

Boost client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline


large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

Sunnyiscoming commented 1 year ago

This information will be reviewed by Fil+ Governance team to confirm validity and then the application will be triggered for notary review. Let us know if you have any questions.

jyma commented 1 year ago


Have you prepared enough token for sector pledge?

Yes,My partners and I have diligently secured the requisite pledge tokens.Moreover, most of the Miner IDs I've listed have recently been undergoing large-scale DC sealing, which you can verify in the link below. f02228866;f01315130;f02240216

Best practice for storing large datasets includes ideally, storing it in 3 or more regions, with 4 or more storage provider operators or owners.You should list Miner ID, Business Entity, Location of sps you will cooperate with.

Got it,I will strictly adhere to the LDN guidelines. As of now, I can list the following SPs. I will continue to search for more SPs to seal in the future.

f02228866 - Feige IT, Tokyo, JP
f01761579 - Feige IT, Hangzhou, CN
f01315130 - Ouruan IT, Chengdu, CN
f02240216 - Lianxing storage, Tokyo, JP
f0673990 - Dayan IT, Hangzhou, CN

Per the for Open, Public Dataset applicants, please complete the following Fil+ registration form to identify yourself as the applicant and also please add the contact information of the SP entities you are working with to store copies of the data.

I had complete the Fil+ registration form, Thanks for your reminder.

herrehesse commented 1 year ago

@jyma Welcome to FIL+!

I'd like to ask you why you selected the datasets:

  1. Encyclopedia of DNA Elements (ENCODE)
  2. 4D Nucleome (4DN)
  3. NIH NCBI Sequence Read Archive (SRA)
  4. UCSC Genome Browser Sequence and Annotations

Love to understand. Most of these sets are already stored multiple times so I am wondering why these add additional value from your request.

Thank you!

jyma commented 1 year ago

@herrehesse Hi~

Thank you for your inquiry regarding the selection of specific datasets, including ENCODE, 4DN, NCBI SRA, and UCSC Genome Browser. I understand your concern about the availability of these datasets elsewhere. Allow me to provide a concise overview of the rationale behind their inclusion:

  1. ENCODE: Despite its widespread availability, ENCODE's ongoing updates and high-quality data remain valuable for researchers requiring up-to-date insights into genome functionality.
  2. 4DN: 4DN offers unique 3D genome insights that complement linear sequence data, shedding light on spatial gene arrangements within the nucleus and their impact on gene regulation.
  3. NCBI SRA: The curated and diverse collection of raw sequencing data in NCBI SRA facilitates hypothesis generation, validation, and exploration across various genomic disciplines.
  4. UCSC Genome Browser: While similar resources exist, the user-friendly interface and track overlay capabilities of UCSC Genome Browser simplify genome visualization and interpretation.

Furthermore, I would like to confirm that the selected datasets do not overlap with the DC had sealed in our nodes. This ensures that we are adding unique value to our existing data resources.

Feel free to reach out if you need further clarification.

Best regards.

herrehesse commented 1 year ago

@jyma Appreciate your response. While you've shed light on the nature of the sets, you haven't clarified why there's a need to store these sets which seem to already exist multiple times on this network:

Screenshot 2023-08-21 at 09 19 02

Additionally, as per the discussions and consensus reached here:, merged datacap requests are not permissible.

I can't endorse an applicant aiming to store sets that have been stored repeatedly, especially when done through a merged request.

I'd suggest exploring other public datasets for this network.

Thank you.

jyma commented 1 year ago

@herrehesse I understand your concerns about the redundant storage of the "ENCODE" dataset, as I expressed in my previous response. It's possible that some of the data I've requested might already be on the Filecoin network, but the dataset I'm requesting hasn't been stored on our nodes.

Additionally, in the screenshot you shared, only a minor portion of the data I've requested has been stored on the Filecoin network.

Regarding the filecoin-project/notary-governance#832 you mentioned, is it the guidelines for LDN application? If it is, I will adhere to it. If not, I believe we shouldn't be discussing it here.

Thank you.

ghost commented 1 year ago

Confirming SP Entities submitted: f02228866 - Feige IT, Tokyo, JP f01761579 - Feige IT, Hangzhou, CN f01315130 - Ouruan IT, Chengdu, CN f02240216 - Lianxing storage, Tokyo, JP f0673990 - Dayan IT, Hangzhou, CN

jyma commented 1 year ago

@Filplus-govteam Yes,I hereby confirm the SP entities provided.

Thank you.

Sunnyiscoming commented 1 year ago

Datacap Request Trigger

Total DataCap requested


Expected weekly DataCap usage rate


Client address


large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Multisig Notary address


Client address


DataCap allocation requested




cryptowhizzard commented 1 year ago


f02228866 is located in US virginia f01761579 is located in CN , Zhejiang, Jiaxing f01315130 is located in CN , Sichuan, Meishan f02240216 is located in US virginia f0673990 is located in CN , Zhejiang , Ningbo , not online ( not reachable )

jyma commented 1 year ago


f0673990 now is the market , and we will switch to boost before sealing.

Thank you.

cryptowhizzard commented 1 year ago

How about that the location mismatches what you were saying?

Miner is not open for connection btw.

Scherm­afbeelding 2023-08-23 om 19 29 00
jyma commented 1 year ago


f0673990 underwent an IP address change about half a year ago. It appears that the on-chain IP address is still the previous old address. On the Filfox browser, we can also see that we have already changed the IP address.


For f02240216、f02228866 the IP results I queried on the server are in Tokyo. I'm not sure where the issue lies, perhaps it's due to differences in the IP lookup websites we are using.


As for the other nodes, the locations you queried are very close to the ones I provided on the map. The administrative division of the node's location that I provided is accurate. The slight differences might be due to the IP lookup website you're using not being precise enough for Chinese IP addresses in terms of administrative regions.

Taking the public IP of f01761579 as an example, I can obtain the accurate administrative region Zhejiang,Hangzhou using a domestic Chinese IP lookup website, while your result shows Zhejiang, Jiaxing.


Such discrepancies between the information provided by IP lookup websites and the actual administrative divisions are likely quite common.If this poses a problem, we can suggest using a unified IP lookup website, so there won't be any discrepancies.

Thank you.

zcfil commented 1 year ago


filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report[^1]

No application info found for this issue on

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

cryptowhizzard commented 1 year ago


Thanks for the clarification. Please note my remark that the miner is not open for connection / retrieval.

jyma commented 1 year ago


Thank you for the reminder. We fixed the issue earlier, but due to my oversight, I didn't post a screenshot.

IMAGE 2023-08-25 02:05:24

IMAGE 2023-08-25 02:05:16

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

-- Commented by Stale Bot.

jyma commented 1 year ago


SuperChaiChai commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network




Datacap Allocated


Signer Address




You can check the status of the message here:

spaceT9 commented 1 year ago


filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report[^1]

No application info found for this issue on

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

luobin544 commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network




Datacap Allocated


Signer Address




You can check the status of the message here:

github-actions[bot] commented 12 months ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

-- Commented by Stale Bot.

github-actions[bot] commented 11 months ago

This application has not seen any responses in the last 14 days, so for now it is being closed. Please feel free to contact the Fil+ Gov team to re-open the application if it is still being processed. Thank you!

-- Commented by Stale Bot.

jyma commented 11 months ago


filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

kevzak commented 11 months ago

SPs taking Deals: f02823036new | Tsuen Wan, Tsuen Wan, HKACT International Telecom Limited | 49.50 TiB | 9.93% | 49.50 TiB | 0.00% f01315130 | Chengdu, Sichuan, CNCHINA UNICOM China169 Backbone | 149.59 TiB | 30.02% | 149.59 TiB | 0.00% f02240216 | Tokyo, Tokyo, JPPCCW Global, Inc. | 149.59 TiB | 30.02% | 149.59 TiB | 0.00% f02228866 | Tokyo, Tokyo, JPPCCW Global, Inc. | 149.56 TiB | 30.02% | 149.50 TiB | 0.04%

SPs listed in application: f02228866 - Feige IT, Tokyo, JP f01761579 - Feige IT, Hangzhou, CN f01315130 - Ouruan IT, Chengdu, CN f02240216 - Lianxing storage, Tokyo, JP f0673990 - Dayan IT, Hangzhou, CN

jyma commented 11 months ago


filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

kernelogic commented 11 months ago

@jyma could you explain the new SP f02823036 not on the list?

maxvint commented 11 months ago


filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

jyma commented 11 months ago

@kernelogic Ours application process has been taking too long, and some of our partners have made changes to their sealing plans. f02823036 is the new node we've added. The information for f02823036 is as follows: f02823036- GX , Tsuen Wan, HK

spaceT9 commented 11 months ago

The explanation looks fair. Do you have any plans to increase the retrieval rate of http?

jyma commented 11 months ago

@spaceT9 We are working on sealing as quickly as possible, and we may using HTTP retrieval during the next Lotus upgrade. Thank you.

kernelogic commented 11 months ago

Sounds good, willing to support.

kernelogic commented 11 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network




Datacap Allocated


Signer Address




You can check the status of the message here:

zcfil commented 11 months ago


filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

zcfil commented 11 months ago

Scrolled through the history and checked out the bot report, looks good, willing to support this round

zcfil commented 11 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network




Datacap Allocated


Signer Address




You can check the status of the message here:

jyma commented 11 months ago
Add new Sps: ids sps region
f02301 topblocks Santa Clara(US)
f03223 topblocks Santa Clara(US)
f0240185 topblocks Santa Clara(US)
f0143858 topblocks Santa Clara(US)
f02818480 SJX Kentucky(USA)
f02817832 SJX Kentucky(USA)
f02823036 GX Tsuen Wan, HK
large-datacap-requests[bot] commented 11 months ago

DataCap Allocation requested

Request number 3

Multisig Notary address


Client address


DataCap allocation requested




jyma commented 11 months ago


filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

spaceT9 commented 11 months ago


filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.