filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
109 stars 62 forks source link

[DataCap Application] Commoncrawl(3/3) #2302

Open nicelove666 opened 5 months ago

nicelove666 commented 5 months ago

Data Owner Name

Commoncrawl

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://commoncrawl.org/

Social Media

https://commoncrawl.org/

Total amount of DataCap being requested

15PiB

Expected size of single dataset (one copy)

2.5PiB

Number of replicas to store

6

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2287
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2204

Describe the data being stored onto Filecoin

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

China

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

We use a script to package the files originally stored in the nginx file server into tar files. Each tar file is controlled to be around 17-30G. Finally, the tar file package is converted into a car file. After the conversion is completed, a record of the car file and The metadata of the source file information is stored in our local system for later query.

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

This website has a lot of data, as far as I know, no one has systematically stored all the data on the Filecoin network.

Please share a sample of the data

https://commoncrawl.org/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

2 to 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Europe

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), HTTP or FTP server, Shipping hard drives

How do you plan to choose storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

No response

How do you plan to make deals to your storage providers

Boost client, Lotus client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

Sunnyiscoming commented 5 months ago

Please provide ID, City, Country, Organization of each SP here.

nicelove666 commented 5 months ago
Provider Location SP Entity or Personal
f02199203 Inner Mongolia Richard
f02223170 HK tianyou
f02831201 GuangDong Juwu Mine
f02824157 BeiJing zhongchuangyun

This is our cooperative SP. Around January 15th, we will add 5-7 SPs from Japan, Vietnam and Hong Kong. When they are launched, we will list them, thank you.

Sunnyiscoming commented 4 months ago

Hello, per the https://github.com/filecoin-project/notary-governance/issues/922 for Open, Public Dataset applicants, please complete the following Fil+ registration form to identify yourself as the applicant and also please add the contact information of the SP entities you are working with to store copies of the data.

This information will be reviewed by Fil+ Governance team to confirm validity and then the application will be allowed to move forward for additional notary review.

Sunnyiscoming commented 4 months ago

SP List provided: [{"providerID":"f02199203","City":"InnerMongolia","Country":"China","SPOrg","Richard"}, {"providerID":"f02223170","City":"HK","Country":"China","SPOrg","tianyou"}, {"providerID":"f02831201","City":"GuangDong","Country":"China","SPOrg","JuwuMine"}, {"providerID":"f02824157","City":"BeiJing","Country":"China","SPOrg","zhongchuangyun"},]

nicelove666 commented 4 months ago
WX20240109-112150@2x

We submitted it, thank you

nicelove666 commented 4 months ago

https://www.ipqualityscore.com/user/search is a public, well-known and unbiased geolocation detection software. I paid to check the SP we cooperate with, and it turns out that their address location is real. f02199203 116.136.130.130 f02824157 116.172.66.38 f02824140 116.172.66.38 f02841613 210.209.77.161 f02831202 14.29.124.50 f0122215 119.167.140.136

Detection method: Find the IP corresponding to the sp in boost, enter the IP, and you can see the detection results. If SP use VPN, the detection score may be greater than 70 points. The detection score is 0 points,means no fraud, which proves that the SP's address is an honest address.

nicelove666 commented 4 months ago
WX20240111-145600@2x WX20240111-145455@2x
nicelove666 commented 4 months ago
WX20240111-144512@2x WX20240111-145015@2x WX20240111-145537@2x

<img width="535" alt="WX20240111-182502@2x" src="https://github.com/filecoin-project/filecoin-plus-large-datasets/assets/129046880/7fc96047-c655-403e-b403-561e0944f9

nicelove666 commented 4 months ago

Can you help us move forward, thank you. @Sunnyiscoming

nicelove666 commented 4 months ago

It took two weeks to apply, but it still hasn’t been approved. Therefore, the cooperative SP has changed, we have updated the cooperative SP:

f02199203 Richard Nei Mongol(Inner Mongolia) 116.136.130.130

WX20240116-160041@2x

f02824157 zhongchuangyun GuangDong 116.172.66.38 f02824140 zhongchuangyun GuangDong 116.172.66.38

2@2x

f02831202 Juwu Mine GuangDong 14.29.124.50

WX20240116-160408@2x

f0122215 SuSuanYun ShanDong 119.167.140.136

WX20240116-160454@2x
nicelove666 commented 4 months ago

This can clearly display the address location of each SP. Facts have proved that the SPs we cooperate with are honest and hope to get your approved. @Sunnyiscoming @Filplus-govteam @galen-mcandrew @Kevin-FF-USA @clriesco

nicelove666 commented 4 months ago

Please tell me, what else do I need to do? @Sunnyiscoming

large-datacap-requests[bot] commented 4 months ago

Deleting comment

@Sunnyiscoming hasn't the permissions to post this comment.

Please, contact the assignee of this issue.

Sunnyiscoming commented 4 months ago

Datacap Request Trigger

Total DataCap requested

15PiB

Expected weekly DataCap usage rate

1PiB

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

large-datacap-requests[bot] commented 4 months ago

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

512TiB

Id

af9739bf-5ed7-4e71-a41a-9703387d3d7c

ipollo00 commented 4 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacectswk4dafjr6af3r5yuizh7hbc4oimxjdup67cn35ti32cgibpj2

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1n5wlrrhoxpkgwij25xrtt7w7g2k3fhbthmdn6ri

Id

af9739bf-5ed7-4e71-a41a-9703387d3d7c

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacectswk4dafjr6af3r5yuizh7hbc4oimxjdup67cn35ti32cgibpj2

SuperChaiChai commented 4 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceazuzdny6ieipf5mg7gczzggk3tfspo7eqpkdvzp4ncxld64rufzw

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f12mckci3omexgzoeosjvstcfxfe4vqw7owdia3da

Id

af9739bf-5ed7-4e71-a41a-9703387d3d7c

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceazuzdny6ieipf5mg7gczzggk3tfspo7eqpkdvzp4ncxld64rufzw

large-datacap-requests[bot] commented 4 months ago

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

512TiB

Id

d029cae8-cd31-43e7-a662-5bf32b3fdcda

nicelove666 commented 4 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 4 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

1ane-1 commented 4 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaced7b7bbm2bkzhsbjfcjeismiqilyfnkuzynhhckx233ya7dsaduz6

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1mdk7s2vntzm6hu35yuo6vjubtrpfnb2awhgvrri

Id

d029cae8-cd31-43e7-a662-5bf32b3fdcda

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaced7b7bbm2bkzhsbjfcjeismiqilyfnkuzynhhckx233ya7dsaduz6

mikezli commented 4 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceadklavzjgpkxq7doxuxqmffy6xrmm36osm2jx6tmw5b7zk4tj6re

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1dnb3uz7sylxk6emti3ififcvu3nlufnnsjui6ea

Id

d029cae8-cd31-43e7-a662-5bf32b3fdcda

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceadklavzjgpkxq7doxuxqmffy6xrmm36osm2jx6tmw5b7zk4tj6re

large-datacap-requests[bot] commented 4 months ago

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

512TiB

Id

edd783b6-a019-4a56-9dac-f19f3e94026a

AlanGreaterheat commented 4 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 4 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

AlanGreaterheat commented 4 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecvo5lwxhrielwv6j6aed2orwhjvovc7cqqjpiigwwl4sjbuppcag

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1pnmzlxj7cfeo2v6oj5nco46hkg2l46wj7o4xxui

Id

edd783b6-a019-4a56-9dac-f19f3e94026a

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecvo5lwxhrielwv6j6aed2orwhjvovc7cqqjpiigwwl4sjbuppcag

Normalnoise commented 4 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacecyzpjjrkm7pzvjphldae7xx6zuwrfrkzmagvrucqt7gnq5ntlxww

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

512.00TiB

Signer Address

f1c5non5yf35avgcpsqvxu4yj54yyvxorwyjochqq

Id

edd783b6-a019-4a56-9dac-f19f3e94026a

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecyzpjjrkm7pzvjphldae7xx6zuwrfrkzmagvrucqt7gnq5ntlxww

large-datacap-requests[bot] commented 4 months ago

DataCap Allocation requested

Request number 3

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

1PiB

Id

9a7cfd96-e241-4712-98c6-54a9708ba57f

Aaron01230 commented 4 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebzyn3jc3xoghq5syfbuzdhccyfsge4jltrjeff6kar4bgy7lpp4k

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

1.00PiB

Signer Address

f1xrnysd4gimg64d4l6qi7ulzwwq22c6vfg6lpw3i

Id

9a7cfd96-e241-4712-98c6-54a9708ba57f

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebzyn3jc3xoghq5syfbuzdhccyfsge4jltrjeff6kar4bgy7lpp4k

kernelogic commented 4 months ago

Please pay attention to distribute to outside of GCR, otherwise LGTM.

kernelogic commented 4 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacebrh4vm5ogeusikpjhlz7abe2ofxswj224fn7buu6cwezpbemprb2

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

1.00PiB

Signer Address

f1yjhnsoga2ccnepb7t3p3ov5fzom3syhsuinxexa

Id

9a7cfd96-e241-4712-98c6-54a9708ba57f

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebrh4vm5ogeusikpjhlz7abe2ofxswj224fn7buu6cwezpbemprb2

nicelove666 commented 4 months ago

512T appeared three times and 1PiB once. Will the next round be 2PiB? @clriesco

clriesco commented 4 months ago

Yes, next round will be 2PiB. There were 2 "requests number 2" due to node not being updated fast enough.

nicelove666 commented 4 months ago

Yes, next round will be 2PiB. There were 2 "requests number 2" due to node not being updated fast enough.

get it,thanks, have a good day.

nicelove666 commented 3 months ago

Why hasn’t the bot triggered the next round of signature requests yet? @clriesco

nicelove666 commented 3 months ago
WX20240204-152911@2x
large-datacap-requests[bot] commented 3 months ago

DataCap Allocation requested

Request number 5

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

2PiB

Id

bb0d8129-82dc-4f40-9ee6-ebce36fbbb61

filplus-checker-app[bot] commented 3 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 41.30% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

joshua-ne commented 3 months ago

Looks good to me. Will support this round but expect to see some improvement on the number of SP's.

joshua-ne commented 3 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebo43if73yqon262i62yd3a5wmg6wlahf3uhtm4rmdthm2ttdchbs

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

2.00PiB

Signer Address

f1xzff5xup63o5sygr2swp4zvcajg54lotliimdty

Id

bb0d8129-82dc-4f40-9ee6-ebce36fbbb61

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebo43if73yqon262i62yd3a5wmg6wlahf3uhtm4rmdthm2ttdchbs

ipfscn commented 3 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacedbuce6pf6wm2a6fuuf4gvhdszd3fpvftwlgz3eqssqvpkw3ln7zi

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

2.00PiB

Signer Address

f1j4n74chme7whbz3yls4a7ixqewb6dijypqg2a3a

Id

bb0d8129-82dc-4f40-9ee6-ebce36fbbb61

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedbuce6pf6wm2a6fuuf4gvhdszd3fpvftwlgz3eqssqvpkw3ln7zi

large-datacap-requests[bot] commented 3 months ago

DataCap Allocation requested

Request number 5

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

2PiB

Id

5aa1deda-5338-406b-bb46-8668cf6b8517

zcfil commented 3 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 3 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 40.89% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

zcfil commented 3 months ago

Browsed through the historical information, checking out the bot reports there are moving towards a healthy trend, willing to support this round

zcfil commented 3 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceajkmk5oycvlaj3sdmbudieerbn6c25mixlzsdguvxmwlrlhm3hro

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

2.00PiB

Signer Address

f1cjzbiy5xd4ehera4wmbz63pd5ku4oo7g52cldga

Id

5aa1deda-5338-406b-bb46-8668cf6b8517

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceajkmk5oycvlaj3sdmbudieerbn6c25mixlzsdguvxmwlrlhm3hro

nj-steve commented 3 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacedu4qjvl67nneux7dc3sbkwrui2kfs5i6tltbvd7oqoi7libga2ge

Address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

Datacap Allocated

2.00PiB

Signer Address

f1xx6555qijma7igpnjspyvdunc4vfxkawnpqy5ii

Id

5aa1deda-5338-406b-bb46-8668cf6b8517

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedu4qjvl67nneux7dc3sbkwrui2kfs5i6tltbvd7oqoi7libga2ge

nj-steve commented 3 months ago

please pay attention to 'Deal Data Replication'

large-datacap-requests[bot] commented 3 months ago

DataCap Allocation requested

Request number 7

Multisig Notary address

f02049625

Client address

f16euns7z5ve6xzgk2yv32eezt4uu62z4c6jiw5hy

DataCap allocation requested

2PiB

Id

32513202-21b0-41bf-9f27-51a8beb55055

Tom-OriginStorage commented 3 months ago

checker:manualTrigger