filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
109 stars 62 forks source link

[DataCap Application] Commoncrawl(1/3) #2204

Closed nicelove666 closed 11 months ago

nicelove666 commented 1 year ago

Data Owner Name

Commoncrawl

What is your role related to the dataset

Dataset Preparer

Data Owner Country/Region

United States

Data Owner Industry

IT & Technology Services

Website

https://commoncrawl.org/

Social Media

https://commoncrawl.org/

Total amount of DataCap being requested

15PiB

Expected size of single dataset (one copy)

1.5PiB

Number of replicas to store

10

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1q5qokywmvh4xdn2g7snu3ysah4bbp6iyqx3kcry

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

The Common Crawl corpus contains petabytes of data, regularly collected since 2008.The corpus contains raw web page data, metadata extracts, and text extracts.Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

China

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

No response

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

No response

Please share a sample of the data

https://commoncrawl.org/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Monthly

For how long do you plan to keep this dataset stored on Filecoin

1 to 1.5 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Europe

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), HTTP or FTP server, Shipping hard drives, Lotus built-in data transfer

How do you plan to choose storage providers

Slack, Filmine

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

No response

How do you plan to make deals to your storage providers

Boost client, Lotus client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

Aaron01230 commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

Aaron01230 commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebcchhufqt6n4rgqwyk7fe4jj5iddgk62rxn5m5v6ynvujuvtbgie

Address

f1q5qokywmvh4xdn2g7snu3ysah4bbp6iyqx3kcry

Datacap Allocated

2.00PiB

Signer Address

f1xrnysd4gimg64d4l6qi7ulzwwq22c6vfg6lpw3i

Id

9f791a72-ea3e-4976-9b03-1635f11d291b

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebcchhufqt6n4rgqwyk7fe4jj5iddgk62rxn5m5v6ynvujuvtbgie

woshidama323 commented 1 year ago

LGTM , Very Healthy report , Will support

woshidama323 commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceb6ufkvtcod3iz3rakmtl2cad7wr7halbcptec3qkzwunsgiwcgqa

Address

f1q5qokywmvh4xdn2g7snu3ysah4bbp6iyqx3kcry

Datacap Allocated

2.00PiB

Signer Address

f12tk3adljauwnd3hjbigpfxb7b7gdlj63p6afwtq

Id

9f791a72-ea3e-4976-9b03-1635f11d291b

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceb6ufkvtcod3iz3rakmtl2cad7wr7halbcptec3qkzwunsgiwcgqa

herrehesse commented 1 year ago

Screenshot 2023-11-10 at 13 28 26

Selfdealing abuse continues, no distribution and most under VPN. Thanks abusive notaries!

nicelove666 commented 1 year ago

Please provide detailed methods for detecting VPNs instead of screenshots, such as detecting websites. Who developed the website and is it trustworthy?

Sunnyiscoming commented 1 year ago

Some of the sps in the table do not participated in Some sps outside the form participated in @nicelove666 Can you explain about that?

nicelove666 commented 12 months ago

Dear person in charge, we have explained it last week and listed the SP form for cooperation.

nicelove666 commented 12 months ago

@kevzak亲,我们用一个表格列出了我们合作的SP的详细信息。请为我们重新打开它。谢谢你的好意。 WX20231109-155711@2x

herrehesse commented 12 months ago

Screenshot 2023-11-13 at 13 23 17

The complete list is inside one continent, that is not distribution. Where are the EU/USA/AUS copies?

nicelove666 commented 12 months ago

We have two SPs located in Singapore

nicelove666 commented 12 months ago

Can you send me the URL of the website you took the screenshot of?

large-datacap-requests[bot] commented 12 months ago

DataCap Allocation requested

Request number 7

Multisig Notary address

f02049625

Client address

f1q5qokywmvh4xdn2g7snu3ysah4bbp6iyqx3kcry

DataCap allocation requested

2PiB

Id

de0ab282-9f64-4ec0-bc0f-4af7c3988761

a1991car commented 12 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 12 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

laurarenpanda commented 12 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 12 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

joshua-ne commented 12 months ago

Everything looks good to me, will support this round.

joshua-ne commented 12 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebpsq7niktcjyq4r5tsig2h7wuqx6insc2tegwgyzm6v247khnsmw

Address

f1q5qokywmvh4xdn2g7snu3ysah4bbp6iyqx3kcry

Datacap Allocated

2.00PiB

Signer Address

f1xzff5xup63o5sygr2swp4zvcajg54lotliimdty

Id

de0ab282-9f64-4ec0-bc0f-4af7c3988761

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebpsq7niktcjyq4r5tsig2h7wuqx6insc2tegwgyzm6v247khnsmw

laurarenpanda commented 12 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacebw63s6ha7bovewkgz2vgrm57hmd4mnijnnzlwlsqtnh6knshvxlm

Address

f1q5qokywmvh4xdn2g7snu3ysah4bbp6iyqx3kcry

Datacap Allocated

2.00PiB

Signer Address

f1bp3tzp536edm7dodldceekzbsx7zcy7hdfg6uzq

Id

de0ab282-9f64-4ec0-bc0f-4af7c3988761

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebw63s6ha7bovewkgz2vgrm57hmd4mnijnnzlwlsqtnh6knshvxlm

SuperChaiChai commented 12 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacedmpcsbziltrihgkdx2szu3bagm5otxew6jwfftghko3lp57npb6q

Address

f1q5qokywmvh4xdn2g7snu3ysah4bbp6iyqx3kcry

Datacap Allocated

2.00PiB

Signer Address

f12mckci3omexgzoeosjvstcfxfe4vqw7owdia3da

Id

de0ab282-9f64-4ec0-bc0f-4af7c3988761

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedmpcsbziltrihgkdx2szu3bagm5otxew6jwfftghko3lp57npb6q

large-datacap-requests[bot] commented 12 months ago

Aborting. Exit Code is Non 0

herrehesse commented 12 months ago

@kevzak WHY IS THIS ALLOWED TO HAPPEN? CONTINUED FRAUD. 2 PIB TO ABUSERS.

SPs taking deals: 02760664 | Beijing, Beijing, CNChina Mobile Communications Group Co., Ltd. | 137.50 TiB | 17.59% | 137.50 TiB | 0.00% f02199203 | Beijing, Beijing, CNCHINA UNICOM China169 Backbone | 51.69 TiB | 6.61% | 51.69 TiB | 0.00% f02221113 | Shenzhen, Guangdong, CNCHINANET-BACKBONE | 80.25 TiB | 10.26% | 80.25 TiB | 0.00% f02823562new | Hong Kong, Central and Western, HKHK Broadband Network Ltd. | 213.34 TiB | 27.29% | 213.34 TiB | 0.00% f02812504 | Hong Kong, Central and Western, HKHK Broadband Network Ltd. | 22.38 TiB | 2.86% | 22.38 TiB | 0.00% f02816081 | Singapore, Singapore, SGZenlayer Inc | 177.84 TiB | 22.75% | 177.84 TiB | 0.00% f02816095 | Singapore, Singapore, SGZenlayer Inc | 98.84 TiB | 12.64% | 98.84 TiB | 0.00%

SP provided upfront: f02812504 | | Coffee Cloud | HK | no | Frank f02810749 | | nebula | HK | no | John f02810751 | | nebula | HK | no | John f02221110 | | LianLi | China | no | Lee f02221111 | | LianLi | China | no | Lee

one SP matched. Closing until updated list and information provided

herrehesse commented 12 months ago

I hope you are proud of cheating! @laurarenpanda & @SuperChaiChai

nicelove666 commented 11 months ago

Looks like the robot bug

nicelove666 commented 11 months ago

@kevzak为什么会允许这种情况发生?持续的欺诈。2 PIB 致施虐者。

SP 接受交易: 02760664 | 北京,北京,中国中国移动通信集团有限公司| 137.50 TiB | 137.50 TiB 17.59% | 137.50 TiB | 137.50 TiB 0.00% f02199203 | 北京,北京,CNCCHINA联通中国169骨干网| 51.69 钛B | 6.61% | 51.69 钛B | 0.00% f02221113 | 广东省深圳市CNCCHINANET-骨干网| 80.25 钛B | 10.26% | 80.25 钛B | 0.00% f02823562new | 香港中西区HKHK宽频网络有限公司| 213.34 钛B | 27.29% | 213.34 钛B | 0.00% f02812504 | 香港中西区HKHK宽频网络有限公司| 22.38 钛B | 2.86% | 22.38 钛B | 0.00% f02816081 | 新加坡,新加坡,SGZenlayer Inc | 177.84 钛B | 22.75% | 177.84 钛B | 0.00% f02816095 | 新加坡,新加坡,SGZenlayer Inc | 98.84 钛B | 12.64% | 98.84 钛B | 0.00%

SP 预先提供: f02812504 | | 咖啡云| 香港 | 没有| 弗兰克 f02810749 | | 星云 | 香港 | 没有| 约翰 f02810751 | | 星云 | 香港 | 没有| 约翰 f02221110 | | 联力 | 中国 | 没有| 李 f02221111 | | 联力 | 中国 | 没有| 李

1 个 SP 匹配。关闭直至提供更新的列表和信息

We have explained and listed the cooperating sp

nicelove666 commented 11 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

nicelove666 commented 11 months ago

We have brought at least 60P of DC to Filecoin, we are a Filecoin enthusiast with a social reputation. Faced with the continued harm and slander, I hope that upright officials can look at things soberly. In fact, we have experienced many injuries, we have suffered many slanders, and we are slowly being let down. I think we should be treated fairly. @kevzak @jbenet @dkkapur @Kevin-FF-USA

herrehesse commented 11 months ago

@nicelove666 Most of that 60P is unretrievable, unindexed, self-dealing datacap for the self enrichment of (your) SP's. NOT for the benefit of this community. Stop acting like you are the victim here.

nicelove666 commented 11 months ago

Let me make it clear once again that I have never attacked or hurt you. I think the courtesy I gave you is enough. But you slandered me so many times that I had to fight back.

First, like you, I am a contributor to FIL+.

Second, like you, I use Public Dataset as the application subject.

Third, the difference between me and you is:

a. I have basically never had CID sharing. CID sharing will not exceed 1% of the 60P. As for you, 4.9P of the 5P applications are CIDs. They are all terrible duplicate data, even if they are Your 2008 still has CID.

b. Many of your large LDN searches are lower than 1%, especially LDNs with numbers below 1500. my LDN are basically higher than 1%.

c. My LDN and data can be downloaded and viewed. I believe that official personnel have also downloaded and verified my data many times. 2204 We also provide a download link for the data https://send.datasetcreators.com/.

Once again, I hope you will stop your invalid attacks, otherwise, I can only treat you the same way you treat me.

nicelove666 commented 11 months ago

It seems that there is something wrong with the bot and it has not been triggered yet.

herrehesse commented 11 months ago

@kevzak WHY IS THIS ALLOWED TO HAPPEN? CONTINUED FRAUD. 2 PIB TO ABUSERS.

SPs taking deals: 02760664 | Beijing, Beijing, CNChina Mobile Communications Group Co., Ltd. | 137.50 TiB | 17.59% | 137.50 TiB | 0.00% f02199203 | Beijing, Beijing, CNCHINA UNICOM China169 Backbone | 51.69 TiB | 6.61% | 51.69 TiB | 0.00% f02221113 | Shenzhen, Guangdong, CNCHINANET-BACKBONE | 80.25 TiB | 10.26% | 80.25 TiB | 0.00% f02823562new | Hong Kong, Central and Western, HKHK Broadband Network Ltd. | 213.34 TiB | 27.29% | 213.34 TiB | 0.00% f02812504 | Hong Kong, Central and Western, HKHK Broadband Network Ltd. | 22.38 TiB | 2.86% | 22.38 TiB | 0.00% f02816081 | Singapore, Singapore, SGZenlayer Inc | 177.84 TiB | 22.75% | 177.84 TiB | 0.00% f02816095 | Singapore, Singapore, SGZenlayer Inc | 98.84 TiB | 12.64% | 98.84 TiB | 0.00%

SP provided upfront: f02812504 | | Coffee Cloud | HK | no | Frank f02810749 | | nebula | HK | no | John f02810751 | | nebula | HK | no | John f02221110 | | LianLi | China | no | Lee f02221111 | | LianLi | China | no | Lee

one SP matched. Closing until updated list and information provided

@Filplus-govteam Can you please close this?

nicelove666 commented 11 months ago

@herrehesse I really don't know what to say.

Do you never look at historical records?

We have explained and disclosed it.

OK, let me disclose it again now.

WX20231130-220228@2x
nicelove666 commented 11 months ago

How much does your 2008 match?

Sunnyiscoming commented 11 months ago

SP List provided: [{"providerID": "f02831201", "City": "Guangzhou", "Country": "CN", "SPOrg","Juwu Mine"}, {"providerID": "f02831202", "City": "Guangzhou", "Country": "CN", "SPOrg","Juwu Mine"}, {"providerID": "f02829200", "City": "Jiangxi", "Country": "CN", "SPOrg","Applecloud"}, {"providerID": "f02832675", "City": "Shenzhen", "Country": "CN", "SPOrg","Zhongchuangyun"}, {"providerID": "f02824157", "City": "Shanxi", "Country": "CN", "SPOrg","Qiankunstorage"}, {"providerID": "f02830476", "City": "Guangdong", "Country": "CN", "SPOrg","datastone"}, {"providerID": "f02824533", "City": "HK", "Country": "CN","SPOrg","Coffee Cloud"}, {"providerID": "f02833198", "City": "HK", "Country": "CN","SPOrg","Coffee Cloud"}, {"providerID": "f02823562", "City": "HK", "Country": "CN","SPOrg","Coffee Cloud"}, {"providerID": "f02812504", "City": "HK", "Country": "CN","SPOrg","Coffee Cloud"}, {"providerID": "f02837094", "City": "HK", "Country": "CN","SPOrg","Coffee Cloud"}, {"providerID": "f02841613", "City": "HK", "Country": "CN","SPOrg","Coffee Cloud"}, {"providerID": "f02829749", "City": "Guangdong", "Country": "CN","SPOrg","xing"}, {"providerID": "f02829748", "City": "Guangdong", "Country": "CN","SPOrg","xing"}, {"providerID": "f02824140", "City": "Ningxia", "Country": "CN","SPOrg","cylinder"}, {"providerID": "f02760664", "City": "XYZ", "Country": "Inner Mongolia","SPOrg","Richard"}, {"providerID": "f02199203", "City": "XYZ", "Country": "Inner Mongolia","SPOrg","Richard"}, {"providerID": "f02816081", "City": "XYZ", "Country": "Singapore","SPOrg","KRAL"}, {"providerID": "f02816095", "City": "XYZ", "Country": "Singapore","SPOrg","KRAL"},]

nicelove666 commented 11 months ago

The robot has been bug for a long time, and the next round of 2P quota has not been triggered. Can you help me? Thank you very much. @Sunnyiscoming @clriesco

nicelove666 commented 11 months ago

Do you need me to submit a bug report? Thank you for your hard work and wish you a nice day. @clriesco

clriesco commented 11 months ago

The bot was not buggy. Your application was tagged with an error because it was approved twice in a row and the bot skipped it. Someone should have reviewed that and removed manually the error tag. I have already done it.

nicelove666 commented 11 months ago

Yes dear, you are right, you are very professional.

In the last round, three notaries signed, one notary proposed and two notaries approved,Approval by two notaries at the same time.

This may be the root cause of the bot not triggering.

So, please tell me, when can the bot trigger the next round.

clriesco commented 11 months ago

It will be automatically triggered in about 1 hour

nicelove666 commented 11 months ago

thank you,dear!

large-datacap-requests[bot] commented 11 months ago

DataCap Allocation requested

Request number 9

Multisig Notary address

f02049625

Client address

f1q5qokywmvh4xdn2g7snu3ysah4bbp6iyqx3kcry

DataCap allocation requested

2PiB

Id

729ecb24-c655-4808-b2e3-df41eefae1b9

nicelove666 commented 11 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

ipollo00 commented 11 months ago

LGTM, willing to support

ipollo00 commented 11 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceav5lryvl55xmynmme2ffhgnwxeihw4bzpakl32c53sb3cj5tanq2

Address

f1q5qokywmvh4xdn2g7snu3ysah4bbp6iyqx3kcry

Datacap Allocated

2.00PiB

Signer Address

f1n5wlrrhoxpkgwij25xrtt7w7g2k3fhbthmdn6ri

Id

729ecb24-c655-4808-b2e3-df41eefae1b9

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceav5lryvl55xmynmme2ffhgnwxeihw4bzpakl32c53sb3cj5tanq2

zcfil commented 11 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.