filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

MongoStorage - CommonCrawl Archive #2040

Open amughal opened 1 year ago

amughal commented 1 year ago

Data Owner Name

Common Crawl

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

United States

Data Owner Industry

Not-for-Profit

Website

https://commoncrawl.org/

Social Media

None.

Total amount of DataCap being requested

10PiB

Expected size of single dataset (one copy)

1PiB

Number of replicas to store

10

Weekly allocation of DataCap requested

300TiB

On-chain address for first allocation

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

MongoStorage is an emerging FileCoin Service Provider. Based in Southern California, USA, and working through a plan. MongoStorage is a FIL Green GOLD Certified and currently working through to be fully ESPA certified provider. The founders have vast experience in networks and systems, and have gone through multiple sessions, presentation at ESPA and featured in the Zero to One Service Provider Twitter session by Protocol Labs. 
We are working as Data Prep in the Slingshot Moonlanding Program and making the most useful data available on the FileCoin network.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

Working with BigDataExchange
SlingShot Moonlanding V3

Describe the data being stored onto Filecoin

The Common Crawl project is a corpus of web crawl data composed of over 50 billion web pages.
Following 10 datasets has been crawled and being prepared.

s3://commoncrawl/crawl-data/CC-MAIN-2022-40 – September/October 2022
s3://commoncrawl/crawl-data/CC-MAIN-2023-14 – March/April 2023
s3://commoncrawl/crawl-data/CC-MAIN-2023-06 – January/February 2023
s3://commoncrawl/crawl-data/CC-MAIN-2020-40 – September 2020
s3://commoncrawl/crawl-data/CC-MAIN-2020-45 – October 2020
s3://commoncrawl/crawl-data/CC-MAIN-2021-39 – September 2021
s3://commoncrawl/crawl-data/CC-MAIN-2021-49 – November/December 2021
s3://commoncrawl/crawl-data/CC-MAIN-2022-05 – January 2022
s3://commoncrawl/crawl-data/CC-MAIN-2022-21 – May 2022
s3://commoncrawl/crawl-data/CC-MAIN-2022-27 – June/July 2022

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

singularity

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

Follow is a sample for one of the dataset. This lists different directory structures, files are in ZIP format, with individual files listed in the list files.

File List   #Files  Total Size
Compressed (TiB)
Segments    CC-MAIN-2021-49/segment.paths.gz    100 
WARC files  CC-MAIN-2021-49/warc.paths.gz   64000   68.66
WAT files   CC-MAIN-2021-49/wat.paths.gz    64000   16.66
WET files   CC-MAIN-2021-49/wet.paths.gz    64000   7.18
Robots.txt files    CC-MAIN-2021-49/robotstxt.paths.gz  64000   0.15
Non-200 responses files CC-MAIN-2021-49/non200responses.paths.gz    64000   2.29
URL index files CC-MAIN-2021-49/cc-index.paths.gz   302 0.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-49/. Also the columnar index has been updated to contain this crawl.

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Sporadic

For how long do you plan to keep this dataset stored on Filecoin

More than 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent), Antarctica

How will you be distributing your data to storage providers

HTTP or FTP server

How do you plan to choose storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

Providers through BigDataExchange
Providers through Aligned
Providers through Slack

How do you plan to make deals to your storage providers

Boost client, Lotus client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

amughal commented 1 year ago

Hi @herrehesse , please let me know if you have further questions?

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 70% of total datacap - f02181705: 100.00%

⚠️ All storage providers are located in the same region.

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval report.

amughal commented 1 year ago

@herrehesse @cryptowhizzard Please see the retrieval improvements over a period of time . Let me know if I can answer any other questions. We really need this tranche to move on. Thank you and appreciate your time.

cryptowhizzard commented 1 year ago

@amughal

Retrieval works now, that is great. I am willing to sign but since you did not follow the FIL+ rules for distribution i would like to know the second and 3rd organization where you are going to send datacap. Can you help me out here?

amughal commented 1 year ago

Hi @cryptowhizzard Second miner is in LasVegas, miner id is "f02181704". Third miners on the East coast will most probably be "f02229460" While the fourth one in Houston, TX will be "f02250603".

Second has the FIL allocation, waiting for the tranche. Both 3rd and fourth are in the process of getting FILs from Darma.

Let me know what other information you need?

And thanks for getting back.

cryptowhizzard commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacedsrsu6ql2h7b3tdl6nt2l4klvdhbnl46gmn3myhuwpvrv34ipf7u

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

300.00TiB

Signer Address

f1krmypm4uoxxf3g7okrwtrahlmpcph3y7rbqqgfa

Id

2ae97ea4-d5af-4a93-9c1c-1a0c76742ce1

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedsrsu6ql2h7b3tdl6nt2l4klvdhbnl46gmn3myhuwpvrv34ipf7u

cryptowhizzard commented 1 year ago

Hi @amughal

I managed to download some of your data. I have put it online here:

http://www.datasetcreators.com/downloadedcarfiles/bafybeibppjdeqvhrjganrxfxrop4ftafdluksd2mqyobdwwnpynpdgxrdy/warc/

Data looks legit. Thanks!

amughal commented 1 year ago

@cryptowhizzard Thank you for your time for doing the data analysis.

Patapon0702 commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 70% of total datacap - f02181705: 100.00%

⚠️ All storage providers are located in the same region.

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval report.

Patapon0702 commented 1 year ago

Will support this round and keep an eye on the improvements for data distribution.

Patapon0702 commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacecazzxqurkyxg6yxontc2x27727wfgog7culdtk6meear2tmueqyi

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

300.00TiB

Signer Address

f1ho2liobpznr7llma6xcl7jtififsfvhdnudn4yy

Id

2ae97ea4-d5af-4a93-9c1c-1a0c76742ce1

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecazzxqurkyxg6yxontc2x27727wfgog7culdtk6meear2tmueqyi

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 70% of total datacap - f02181705: 86.23%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

amughal commented 1 year ago

This is an active LDN, recently tranche was approved, sealing is in progress. I'm not sure why Stale BOT kicked in. Any ideas? Thanks

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 3

Multisig Notary address

f02049625

Client address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

DataCap allocation requested

600TiB

Id

776a5c91-5a05-47bd-8ae1-56ba2271bd8d

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f02049625

Client address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Rule to calculate the allocation request amount

200% of weekly dc amount requested

DataCap allocation requested

600TiB

Total DataCap granted for client so far

272848.4YiB

Datacap to be granted to reach the total amount requested by the client (10PiB)

272848.4YiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
11551 4 300TiB 41.67 73.70TiB
liyunzhi-666 commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 1 storage providers sealed too much duplicate data - f02250603: 20.57%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

liyunzhi-666 commented 1 year ago

I can support this round.

liyunzhi-666 commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacechzjdtvggwzqvrxm6f4gmda3ih5rnnwsscfzwue4e4ndpwgx377i

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

600.00TiB

Signer Address

f1pszcrsciyixyuxxukkvtazcokexbn54amf7gvoq

Id

776a5c91-5a05-47bd-8ae1-56ba2271bd8d

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacechzjdtvggwzqvrxm6f4gmda3ih5rnnwsscfzwue4e4ndpwgx377i

amughal commented 1 year ago

Thank you @liyunzhi-666 , appreciate your quick response.

zcfil commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 1 storage providers sealed too much duplicate data - f02250603: 20.57%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

zcfil commented 1 year ago

The data is very healthy, pay attention to the distribution of CID data, and this round of audit has been passed

zcfil commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceavw4dq6l4byv4mxqkdnqyn22vbkp46anhlm4yq3hqoaow3jogj72

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

600.00TiB

Signer Address

f1cjzbiy5xd4ehera4wmbz63pd5ku4oo7g52cldga

Id

776a5c91-5a05-47bd-8ae1-56ba2271bd8d

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceavw4dq6l4byv4mxqkdnqyn22vbkp46anhlm4yq3hqoaow3jogj72

amughal commented 1 year ago

Thank you @zcfil

MatrixStorage commented 1 year ago

I am willing to support

amughal commented 1 year ago

Thank you @MatrixStorage . I will ping you for the next tranche. Appreciated.

MatrixStorage commented 1 year ago

谢谢@MatrixStorage。我会通知您下一批。赞赏。

OK, I'll stay tuned

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 4

Multisig Notary address

f02049625

Client address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

DataCap allocation requested

1.17PiB

Id

d74927b4-9a63-4a07-b495-6c88dc7b244e

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f02049625

Client address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Rule to calculate the allocation request amount

400% of weekly dc amount requested

DataCap allocation requested

1.17PiB

Total DataCap granted for client so far

545696821063757201408.0YiB

Datacap to be granted to reach the total amount requested by the client (10PiB)

545696821063757201408.0YiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
27931 4 600TiB 61.39 156.21TiB
amughal commented 1 year ago

@MatrixStorage I'm looking forward for the next round of sealing with one additional miner in this round. Please let me know if you have any questions. Thank you.

cryptowhizzard commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebsarmsh3nvmaio6disf2rhcwhb6sjdordhpmnzzy3aujb7kossve

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

1.17PiB

Signer Address

f1krmypm4uoxxf3g7okrwtrahlmpcph3y7rbqqgfa

Id

d74927b4-9a63-4a07-b495-6c88dc7b244e

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebsarmsh3nvmaio6disf2rhcwhb6sjdordhpmnzzy3aujb7kossve

amughal commented 1 year ago

@cryptowhizzard Thank you for proposing the next tranche.

jamerduhgamer commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 30% of total datacap - f02250603: 60.77%, f02181705: 31.25%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

jamerduhgamer commented 1 year ago

Hi @amughal, I see that you have onboarded another SP but now that new SP has the majority of the datacap. Please continue distributing the datacap across the other SPs. Will approve the next tranche for now.

amughal commented 1 year ago

Definitely. Another SP will be onboarded in this tranche. Thank you.

jamerduhgamer commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceaqcvh5owqewb4ogugmiyt44lbbxt6xo5zcsrgcwelkhueochwxhg

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

1.17PiB

Signer Address

f1kqdiokoeubyse4qpihf7yrpl7czx4qgupx3eyzi

Id

d74927b4-9a63-4a07-b495-6c88dc7b244e

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceaqcvh5owqewb4ogugmiyt44lbbxt6xo5zcsrgcwelkhueochwxhg

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 4

Multisig Notary address

f02049625

Client address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

DataCap allocation requested

1.17PiB

Id

db392b62-6984-4e2e-a2e0-b7b3818ec84f

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f02049625

Client address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Rule to calculate the allocation request amount

400% of weekly dc amount requested

DataCap allocation requested

1.17PiB

Total DataCap granted for client so far

545696821063757201408.0YiB

Datacap to be granted to reach the total amount requested by the client (10PiB)

545696821063757201408.0YiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
30406 4 600TiB 60.46 102.25TiB
github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

-- Commented by Stale Bot.

amughal commented 1 year ago

This is an active LDN. Please educate BOT that due to market uncertainties, SPs at time stop sealing. So please please add that AI in the BOT. Thank you

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 30% of total datacap - f02250603: 60.67%, f02181705: 31.20%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

zcfil commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 30% of total datacap - f02250603: 60.67%, f02181705: 31.20%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.