filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
109 stars 62 forks source link

[DataCap Application] Commoncrawl(2/3) #2287

Closed nicelove666 closed 8 months ago

nicelove666 commented 11 months ago

Data Owner Name

Commoncrawl

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

United States

Data Owner Industry

Life Science / Healthcare

Website

https://commoncrawl.org/

Social Media

https://commoncrawl.org/

Total amount of DataCap being requested

15PiB

Expected size of single dataset (one copy)

1.5P

Number of replicas to store

10

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2204
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2045
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1947
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1946
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1845
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1846
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1847
https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1848

Describe the data being stored onto Filecoin

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.
Primary training corpus in every LLM.82% of raw tokens used to train GPT-3.Free and open corpus since 2007.Cited in over 8000 research papers.3–5 billion new pages added each month.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

United States

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

We use a script to package the files originally stored in the nginx file server into tar files. Each tar file is controlled to be around 17-30G. Finally, the tar file package is converted into a car file. After the conversion is completed, a record of the car file and The metadata of the source file information is stored in our local system for later query.

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

This website has a lot of data, as far as I know, no one has systematically stored all the data on the Filecoin network.

Please share a sample of the data

https://commoncrawl.org/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

2 to 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Africa, North America, Europe

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), Shipping hard drives, Lotus built-in data transfer

How do you plan to choose storage providers

Slack, Filmine, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

f02824140 
f02824157
f02831201
f02831202
f02816081
f02816095
f02841613
f02199203  
f02760664
f02223170

How do you plan to make deals to your storage providers

Boost client, Lotus client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 11 months ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

nicelove666 commented 11 months ago

The bot of 2204 seems to be bug, the bot did not trigger the next round of signature requests. We submitted the cooperative SP in detail on 2204. The following are the SPs that this LDN will cooperate with. We have listed the SPs, location, and entities in detail. We look forward to your early approval us and review of online transactions. Thank you for your trust and hard work. , we believe together that Filecoin will get better and better.

Provider Location SP Entity or Personal
f02841613 HK Coffeecould
f02831201 GuangDong Juwu Mine
f02831202 GuangDong Juwu Mine
f02816081 Singapore KRAL
f02816095 Singapore KRAL
f02824157 BeiJing zhongchuangyun
f02824140 BeiJing zhongchuangyun
f02223170 tianyou avn
f02199203 Inner Mongolia Richard
f02760664 Inner Mongolia Richard
cryptowhizzard commented 11 months ago

Give me a proper index files of your deals, give me two EU and USA miners with no VPN. Show me retrieval on those unsealed copies. Say your data is correct, love to support.

But NOT like this, the current form.

nicelove666 commented 11 months ago

First, we met in a video conference. Maybe you forgot that I am an American. I have been in China recently, the SPs I cooperate with are mainly in Asia, but this does not mean that we do not have foreign SPs.We also have a team in Singapore, and I am contacting FF to meet.During labweek, I also went to Türkiye.

Second, If you care about the sp we work with, I'd be happy to tell you,but they won't start now: f01422327(Japan) f02229545 (Los Angeles) f02252024(United States)

Third, next week or around December 15th, we will have a European SP to start.it is a new SP, I don not know the SP until it is started. When the SP is established, we will disclose it in advance.

Fourth, now that the AC robot is online, leave everything to the intelligent AC robot, and we will meet the requirements of the AC robot.

WX20231201-233004@2x
nicelove666 commented 11 months ago

I hope you can push it forward @kevzak @Filplus-govteam @Sunnyiscoming @Kevin-FF-USA @galen-mcandrew

cryptowhizzard commented 11 months ago

@nicelove666 Business names and contact information please. I will check for VPN use.

nicelove666 commented 11 months ago

All information is public and we have submitted it. In addition, please use a professional and recognized website to check, such as this website https://seon.io/. There are many similar testing websites. I hope you can make your testing website public so that you can have credibility. I also hope that your testing tool will produce the same results as this type of recognized testing website.

Sunnyiscoming commented 11 months ago

Hello, per the https://github.com/filecoin-project/notary-governance/issues/922 for Open, Public Dataset applicants, please complete the following Fil+ registration form to identify yourself as the applicant and also please add the contact information of the SP entities you are working with to store copies of the data.

This information will be reviewed by Fil+ Governance team to confirm validity and then the application will be allowed to move forward for additional notary review.

herrehesse commented 11 months ago

All information is public and we have submitted it.

Can you show me in here with contact information? Love to perform due diligence.

nicelove666 commented 11 months ago

We submitted it, hope to see your progress @Sunnyiscoming

nicelove666 commented 11 months ago
WX20231205-185036@2x
nicelove666 commented 11 months ago

Is there any update here? @Sunnyiscoming

ghost commented 11 months ago

@nicelove666 - where are the 10 SPs onboarding these 10 copies? We see 3 SPs listed on your registration form.

f02841613 | coffeecloud | HK | no | ted f02831201 | Juwu Mine | GuangDong | no | Jon f02831202 | Juwu Mine | GuangDong | no | Jon f02824157 | zhongchuangyun | BeiJing | no | lisa f02824140 | zhongchuangyun | BeiJing | no | lisa

nicelove666 commented 11 months ago

Hello, @Filplus-govteam, we fill in the SP according to the requirements of the registration form. However, only 5 SPs can be filled in the registration form.

In order to show our cooperative SPs , we have listed the cooperative SPs in detail in the application form,Hope you can push us forward.

The bot of 2204 seems to be bug, the bot did not trigger the next round of signature requests. We submitted the cooperative SP in detail on 2204. The following are the SPs that this LDN will cooperate with. We have listed the SPs, location, and entities in detail. We look forward to your early approval us and review of online transactions. Thank you for your trust and hard work. , we believe together that Filecoin will get better and better.

Provider Location SP Entity or Personal f02841613 HK Coffeecould f02831201 GuangDong Juwu Mine f02831202 GuangDong Juwu Mine f02816081 Singapore KRAL f02816095 Singapore KRAL f02824157 BeiJing zhongchuangyun f02824140 BeiJing zhongchuangyun f02223170 tianyou avn f02199203 Inner Mongolia Richard f02760664 Inner Mongolia Richard

ghost commented 11 months ago

This shows 6 SPs. You said 10 copies, who is storing all the copies?

Also can you show proof of 1.5PiB dataset from commoncrawl? Which dataset?

nicelove666 commented 11 months ago

@Filplus-govteam Why are these 6 SPs, not 10 SPs? Is the counting unit of SPs "company" or "node"? There are 10 nodes here from 6 companies.

nicelove666 commented 11 months ago

I hope we can set a clear rule about the number of sps, whether they are companies or nodes. then we can launch a issues, everyone will abide by this rule.

nicelove666 commented 11 months ago

https://commoncrawl.org/ This data set has at least 4P of data. With 10 backups, we can apply for a total of 40P. If I calculated it wrong, please tell me. Thank you.

ghost commented 11 months ago

Got it, so you are storing 10 copies across 6 companies.

Per guidelines, no more than one copy per miner ID and no more than 30% per company. Thanks

ghost commented 11 months ago

https://commoncrawl.org/ This data set has at least 4P of data. With 10 backups, we can apply for a total of 40P. If I calculated it wrong, please tell me. Thank you.

Yes, exactly, its very big datasets. So which portion of the 4PiB are you storing? For your data sample you posted their website.

What specific dataset are you storing that is 1.5PiB?

nicelove666 commented 11 months ago

Thank you for taking the time to communicate with me. I appreciate the opportunity to have further discussions with you, such as notary meetings or offline conferences. We have brought in over 100P of DC for Filecoin, and I will provide detailed information in the V5 application.

Now, if possible, I kindly request your assistance in advancing this LDN. Thank you for your hard work, and I wish you a wonderful weekend in advance.

nicelove666 commented 11 months ago

2204

nicelove666 commented 11 months ago

We have stored the downloaded data in 2204, you can view it at any time

nicelove666 commented 11 months ago

In order to save you time, we can upload the data here at http://send.datasetcreators.com at any time. In fact, we have uploaded the data here multiple times for everyone to see, but this website is only valid for 7 days.

But it doesn't matter, I can still upload it for you if you need it.

ghost commented 11 months ago

You have stored what in 2204? There are no details there, there are no dataset details here.

What did you store 14PiB in 2204? What is being stored that is different in 2287? How do we know? How can we see?

ghost commented 11 months ago

Just asking you to add more detail as to which applications include which portions of CommonCrawl. Otherwise there is no record of anything you have stored/will store to look back on

nicelove666 commented 11 months ago

Well, I understand. we will reply to your question in detail. I hope this is a pleasant communication.

nicelove666 commented 11 months ago

We did secondary development based on https://github.com/karust/gogetcrawl. After downloading a batch of data, it automatically splits and packages it into tar and converts it into a car file, so as long as the same download parameters are set, it can be downloaded and Generate car files with the same piececid, which can ensure multi-node backup of the files. Set parameters pointing to different parts of the data set when starting the program to download and generate car files. Currently, all commoncrawl data sets are as follows:

CC-MAIN-2023-MAR-MAY-OCT CC-MAIN-2022-23-SEP-NOV-JAN (2204) CC-MAIN-2022-MAY-JUN-AUG (2204) CC-MAIN-2021-22-OCT-NOV-JAN (2204) CC-MAIN-2021-JUN-JUL-SEP (2204) CC-MAIN-2021-FEB-APR-MAY (2204) CC-MAIN-2020-21-OCT-NOV-JAN CC-MAIN-2020-JUL-AUG-SEP CC-MAIN-2020-FEB-MAR-MAY CC-MAIN-2019-20-NOV-DEC-JAN CC-MAIN-2019-AUG-SEP-OCT CC-MAIN-2019-MAY-JUN-JUL CC-MAIN-2019-FEB-MAR-APR CC-MAIN-2018-19-NOV-DEC-JAN CC-MAIN-2018-AUG-SEP-OCT CC-MAIN-2018-MAY-JUN-JUL CC-MAIN-2018-FEB-MAR-APR CC-MAIN-2018-JAN CC-MAIN-2017-18-NOV-DEC-JAN CC-MAIN-2017-AUG-SEP-OCT CC-MAIN-2017-MAY-JUN-JUL CC-MAIN-2017-FEB-MAR-APR-HOSTGRAPH

Taking LDN2204 as an example, the downloaded and packaged data sets include CC-MAIN-2022-23-SEP-NOV-JAN, CC-MAIN-2022-MAY-JUN-AUG, CC-MAIN-2021-22-OCT-NOV -JAN, CC-MAIN-2021-JUN-JUL-SEP, CC-MAIN-2021-FEB-APR-MAY total about 1.6P. After the car file is generated, the corresponding metadata file will be generated. The file name is the name and starting position of the data set. Based on the metadata file, you can know which part of the total data set is saved by each LDN. For example 1-2204@CC-MAIN-2023-06.csv

WechatIMG16420 WechatIMG16427 WechatIMG16429
ghost commented 11 months ago

This is the kind of information that is valuable to see about a dataset, thank you for sharing

nicelove666 commented 11 months ago

Thanks for your approbate, hope to see updates.

nicelove666 commented 11 months ago

@Sunnyiscoming

Sunnyiscoming commented 11 months ago

Datacap Request Trigger

Total DataCap requested

15PiB

Expected weekly DataCap usage rate

1PiB

Client address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

large-datacap-requests[bot] commented 11 months ago

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

DataCap allocation requested

512TiB

Id

2ed85c7e-7373-4149-bfc2-8a302c37215b

stcloudlisa commented 11 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacedwz56khl7b4zohrsed6jib3cvuea32gkzjp6oyevus42273rfc4i

Address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

Datacap Allocated

512.00TiB

Signer Address

f1jvvltduw35u6inn5tr4nfualyd42bh3vjtylgci

Id

2ed85c7e-7373-4149-bfc2-8a302c37215b

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedwz56khl7b4zohrsed6jib3cvuea32gkzjp6oyevus42273rfc4i

1ane-1 commented 11 months ago

Support for the first round.

1ane-1 commented 11 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacedbjha6esfcwt5wv2yihvyu7gk5fajvq4xmyg5ysgu67ffnuzzsw4

Address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

Datacap Allocated

512.00TiB

Signer Address

f1mdk7s2vntzm6hu35yuo6vjubtrpfnb2awhgvrri

Id

2ed85c7e-7373-4149-bfc2-8a302c37215b

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedbjha6esfcwt5wv2yihvyu7gk5fajvq4xmyg5ysgu67ffnuzzsw4

nicelove666 commented 11 months ago

The quota has been used up, but the robot did not trigger the next round of signatures. Please help us. @clriesco

herrehesse commented 11 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

herrehesse commented 11 months ago

Screenshot 2023-12-15 at 08 50 06 Screenshot 2023-12-15 at 08 50 48

@Filplus-govteam As we can clearly see by the evidence provided above:

STOP allowing abuse once and for all. Close this LDN, note the abusive notaries (again) and listen to me once I say that @nicelove666 is not adhering to the rules like all of their previous LDNs.

@Kevin-FF-USA @galen-mcandrew @simonkim0515

nicelove666 commented 11 months ago

Please publish your website

nicelove666 commented 11 months ago

Randomly insulting the person who brought 100P DC to Filecoin is not a wise choice.

large-datacap-requests[bot] commented 11 months ago

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

DataCap allocation requested

512TiB

Id

b1b5b89c-5c70-41d5-8ce2-cebf4d6f5017

nicelove666 commented 11 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

sxxfuture-official commented 11 months ago

LGTM, will support this round.

sxxfuture-official commented 11 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceds2wqnavozrypzsfods5cdvovd5wzw3havw7xoe22l2qmkbunzje

Address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

Datacap Allocated

512.00TiB

Signer Address

f1foiomqlmoshpuxm6aie4xysffqezkjnokgwcecq

Id

b1b5b89c-5c70-41d5-8ce2-cebf4d6f5017

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceds2wqnavozrypzsfods5cdvovd5wzw3havw7xoe22l2qmkbunzje

Normalnoise commented 11 months ago

Report shows the case is healthy, willing to support this round

Normalnoise commented 11 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacea3zmtgqglw5l5r7qieej2pi642se7g2cqx2cy4hgf4sknwyxibh6

Address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

Datacap Allocated

512.00TiB

Signer Address

f1c5non5yf35avgcpsqvxu4yj54yyvxorwyjochqq

Id

b1b5b89c-5c70-41d5-8ce2-cebf4d6f5017

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacea3zmtgqglw5l5r7qieej2pi642se7g2cqx2cy4hgf4sknwyxibh6

large-datacap-requests[bot] commented 11 months ago

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f1ht5xh5qtccibzvozb5li43cdhaivheuhy2fje3i

DataCap allocation requested

512TiB

Id

0cb05f47-d20e-4251-8eba-a5b4c162d1e2