filplus-bookkeeping / DAYOU

Bookkeeping repo for Allocator #1010
0 stars 0 forks source link

[DataCap Application] Commoncrawl #3

Closed nike-mp closed 1 month ago

nike-mp commented 1 month ago

Version

1

DataCap Applicant

FileTech

Project ID

FileTech-02

Data Owner Name

CommonCrawl

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://commoncrawl.org/

Social Media Handle

https://commoncrawl.org/

Social Media Type

Slack

What is your role related to the dataset

Data Preparer

Total amount of DataCap being requested

4PiB

Expected size of single dataset (one copy)

500TiB

Number of replicas to store

8

Weekly allocation of DataCap requested

512TiB

On-chain address for first allocation

f1wy6ik2ns5oypb4yx6uhuf55hxbzihugg674jzvi

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

FileTech focuses on providing excellent data storage solutions. We have a passionate and knowledgeable team with extensive experience and expertise in the field of data storage. Whether it's data storage, data management, data recovery, or data center design and construction, we possess abundant technical capabilities and solutions.

At FileTech, we understand the importance of data in modern businesses. We not only offer high-performance data storage devices and solutions, but also provide comprehensive data management tools to help clients efficiently organize, classify, and protect their data assets.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

United States

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

No response

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

No response

Please share a sample of the data

357-resule 
2020-08-25 17:22:47  398.4 MiB hrrr_v2.20160823/conus/hrrr.t09z.wrfprsf13.grib2
2020-08-25 17:22:52  396.0 MiB hrrr_v2.20160823/conus/hrrr.t09z.wrfprsf14.grib2
2020-08-25 17:22:52  394.6 MiB hrrr_v2.20160823/conus/hrrr.t09z.wrfprsf15.grib2
2020-08-25 17:22:52  390.2 MiB hrrr_v2.20160823/conus/hrrr.t09z.wrfprsf16.grib2
2020-08-25 17:23:08  387.1 MiB hrrr_v2.20160823/conus/hrrr.t09z.wrfprsf17.grib2
2020-08-25 17:23:05  384.8 MiB hrrr_v2.20160823/conus/hrrr.t09z.wrfprsf18.grib2
2021-09-28 03:48:22   31.6 KiB index.html

Total Objects: 43282174
   Total Size: 2.1 PiB

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

2 to 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, North America

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), IPFS, Shipping hard drives, Lotus built-in data transfer

How did you find your storage providers

Slack, Partners

If you answered "Others" in the previous question, what is the tool or platform you used

No response

Please list the provider IDs and location of the storage providers you will be working with.

f03071472- HK
f03028318- HK
f02830476- Guangdong
f03028326 -Guangdong
f03028321- HK
f03028325- HK
f03028310- HK
f03028315- HK

How do you plan to make deals to your storage providers

Boost client, Lotus client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

datacap-bot[bot] commented 1 month ago

Application is waiting for allocator review

DaYouGroup commented 1 month ago

Thank you for applying for DC to DAYOU. Please understand even if there is something clumsy at the beginning of the allocator system. Before DC allocation, I would like to ask you a few questions.

Have you ever applied for DC with the same content to the previous LDN? If so, please let me know via link. Have you been allocated a DC? What is the DC size you received? Did you apply for DC with a different DC allocator pathway other than here? If so, please let me know via link. Can the SPs you want to save be saved as soon as they are DC allocated?

In the initial round, 50TiB is allocated by default. After that, we will look at the deployment status and compliance with FIL+ rules and proceed to the next round. Is that okay? @nike-mp

nike-mp commented 1 month ago

Hello, on behalf of the taskmaster, thank you for your reply. Although this is our first request, we have enough experience. Our technical staff have worked in a well-known Filecoin company for three years, and we have rich experience in data downloading, processing, and saving. We conduct secondary development based on https://github.com/karust/gogetcrawl. After downloading a batch of data from Amazon Web Services S3, it automatically handles packaging and generates car files. As long as the same download parameters are set, the same piececid car file can be downloaded and generated, ensuring that different SPs can download and encapsulate the same file, ensuring that the data backup meets the requirements. After the car file is generated, the corresponding unit data will be generated at the same time, including filename, data_cid, piece_cid, piece_size and other information and written into the csv file. The csv file is the datacap distribution file that will be used later, and the file name is the tag name and starting position of the downloaded data set. Based on the metadata file, you can tell which portion of the total data set is saved by each LDN. Data download can be performed directly on the SP's device, or it can be processed and mailed to the SP through the hard disk. We are currently ready, with staking, data downloads, personnel, etc. all ready. Hope to get your help.

50T seems relatively small. Can you give us 300-600TiB? Because we have cooperated with 8-10 SPs, it is difficult to allocate 50T in the first round. We may need 512TiB for better allocation.

DaYouGroup commented 1 month ago

Received your reply, thank you. I can meet your requirements and hope to keep your project running healthily! @nike-mp

datacap-bot[bot] commented 1 month ago

Datacap Request Trigger

Total DataCap requested

4PiB

Expected weekly DataCap usage rate

512TiB

DataCap Amount - First Tranche

500TiB

Client address

f1wy6ik2ns5oypb4yx6uhuf55hxbzihugg674jzvi

datacap-bot[bot] commented 1 month ago

DataCap Allocation requested

Multisig Notary address

Client address

f1wy6ik2ns5oypb4yx6uhuf55hxbzihugg674jzvi

DataCap allocation requested

500TiB

Id

e5752f3a-e142-4d0f-8c06-3ce4c2e30330

datacap-bot[bot] commented 1 month ago

Application is ready to sign

datacap-bot[bot] commented 1 month ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceb3dokovkbxarmlvjvaoybne35gpgbhvyddsv2mephpafiqwhicao

Address

f1wy6ik2ns5oypb4yx6uhuf55hxbzihugg674jzvi

Datacap Allocated

500TiB

Signer Address

f1cl4gtwjlt5udxwootzllba3kcyisjphujviiimq

Id

e5752f3a-e142-4d0f-8c06-3ce4c2e30330

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceb3dokovkbxarmlvjvaoybne35gpgbhvyddsv2mephpafiqwhicao

datacap-bot[bot] commented 1 month ago

Application is Granted

nike-mp commented 1 month ago

checker:manualTrigger

datacap-bot[bot] commented 1 month ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 90% of total datacap - f03086293: 100.00%

⚠️ All storage providers are located in the same region.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

nike-mp commented 1 month ago

Data is not updated

nike-mp commented 1 month ago

Our sp is distributed in mainland China, Hong Kong and Singapore. Each sp stores no more than 30%. Wait for data to be updated.

DaYouGroup commented 1 month ago

checker:manualTrigger

datacap-bot[bot] commented 1 month ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 90% of total datacap - f03086293: 100.00%

⚠️ All storage providers are located in the same region.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

datacap-bot[bot] commented 1 month ago

Application is in Refill

gimims commented 1 month ago

checker:manualTrigger

datacap-bot[bot] commented 1 month ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 90% of total datacap - f03086293: 100.00%

⚠️ All storage providers are located in the same region.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

gimims commented 1 month ago

checker:manualTrigger

datacap-bot[bot] commented 1 month ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 90% of total datacap - f03086293: 100.00%

⚠️ All storage providers are located in the same region.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

nike-mp commented 1 month ago

checker:manualTrigger

datacap-bot[bot] commented 1 month ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 90% of total datacap - f03086293: 99.57%

⚠️ 100.00% of Storage Providers have retrieval success rate equal to zero.

⚠️ The average retrieval success rate is 0.00%

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

DaYouGroup commented 1 month ago

checker:manualTrigger

datacap-bot[bot] commented 1 month ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 90% of total datacap - f03086293: 99.57%

⚠️ 100.00% of Storage Providers have retrieval success rate equal to zero.

⚠️ The average retrieval success rate is 0.00%

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

nike-mp commented 1 month ago

checker:manualTrigger

datacap-bot[bot] commented 1 month ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

⚠️ 100.00% of Storage Providers have retrieval success rate equal to zero.

⚠️ The average retrieval success rate is 0.00%

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

nike-mp commented 1 month ago

The data has not been updated for 10 days. We close this application and reapply. Thank you. @DaYouGroup

gimims commented 2 weeks ago

checker:manualTrigger

gimims commented 2 weeks ago

checker:manualTrigger

datacap-bot[bot] commented 2 weeks ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

⚠️ 100.00% of Storage Providers have retrieval success rate equal to zero.

⚠️ The average retrieval success rate is 0.00%

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

datacap-bot[bot] commented 2 weeks ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

⚠️ 100.00% of Storage Providers have retrieval success rate equal to zero.

⚠️ The average retrieval success rate is 0.00%

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.