filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

MongoStorage - CommonCrawl Archive #2040

Open amughal opened 1 year ago

amughal commented 1 year ago

Data Owner Name

Common Crawl

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

United States

Data Owner Industry

Not-for-Profit

Website

https://commoncrawl.org/

Social Media

None.

Total amount of DataCap being requested

10PiB

Expected size of single dataset (one copy)

1PiB

Number of replicas to store

10

Weekly allocation of DataCap requested

300TiB

On-chain address for first allocation

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

MongoStorage is an emerging FileCoin Service Provider. Based in Southern California, USA, and working through a plan. MongoStorage is a FIL Green GOLD Certified and currently working through to be fully ESPA certified provider. The founders have vast experience in networks and systems, and have gone through multiple sessions, presentation at ESPA and featured in the Zero to One Service Provider Twitter session by Protocol Labs. 
We are working as Data Prep in the Slingshot Moonlanding Program and making the most useful data available on the FileCoin network.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

Working with BigDataExchange
SlingShot Moonlanding V3

Describe the data being stored onto Filecoin

The Common Crawl project is a corpus of web crawl data composed of over 50 billion web pages.
Following 10 datasets has been crawled and being prepared.

s3://commoncrawl/crawl-data/CC-MAIN-2022-40 – September/October 2022
s3://commoncrawl/crawl-data/CC-MAIN-2023-14 – March/April 2023
s3://commoncrawl/crawl-data/CC-MAIN-2023-06 – January/February 2023
s3://commoncrawl/crawl-data/CC-MAIN-2020-40 – September 2020
s3://commoncrawl/crawl-data/CC-MAIN-2020-45 – October 2020
s3://commoncrawl/crawl-data/CC-MAIN-2021-39 – September 2021
s3://commoncrawl/crawl-data/CC-MAIN-2021-49 – November/December 2021
s3://commoncrawl/crawl-data/CC-MAIN-2022-05 – January 2022
s3://commoncrawl/crawl-data/CC-MAIN-2022-21 – May 2022
s3://commoncrawl/crawl-data/CC-MAIN-2022-27 – June/July 2022

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

singularity

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

Follow is a sample for one of the dataset. This lists different directory structures, files are in ZIP format, with individual files listed in the list files.

File List   #Files  Total Size
Compressed (TiB)
Segments    CC-MAIN-2021-49/segment.paths.gz    100 
WARC files  CC-MAIN-2021-49/warc.paths.gz   64000   68.66
WAT files   CC-MAIN-2021-49/wat.paths.gz    64000   16.66
WET files   CC-MAIN-2021-49/wet.paths.gz    64000   7.18
Robots.txt files    CC-MAIN-2021-49/robotstxt.paths.gz  64000   0.15
Non-200 responses files CC-MAIN-2021-49/non200responses.paths.gz    64000   2.29
URL index files CC-MAIN-2021-49/cc-index.paths.gz   302 0.2

The Common Crawl URL Index for this crawl is available at: https://index.commoncrawl.org/CC-MAIN-2021-49/. Also the columnar index has been updated to contain this crawl.

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Sporadic

For how long do you plan to keep this dataset stored on Filecoin

More than 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent), Antarctica

How will you be distributing your data to storage providers

HTTP or FTP server

How do you plan to choose storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

Providers through BigDataExchange
Providers through Aligned
Providers through Slack

How do you plan to make deals to your storage providers

Boost client, Lotus client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

zcfil commented 1 year ago

The data storage is relatively centralized, and the new f032824 node requires data balancing. Please adjust the data balancing storage later

zcfil commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceaqfwgau4nh4r7dlca77aw7odgz5c7flnyxkebkg4sgovt3szdhue

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

1.17PiB

Signer Address

f1cjzbiy5xd4ehera4wmbz63pd5ku4oo7g52cldga

Id

db392b62-6984-4e2e-a2e0-b7b3818ec84f

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceaqfwgau4nh4r7dlca77aw7odgz5c7flnyxkebkg4sgovt3szdhue

cryptowhizzard commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacea2vc7oe2ofdas5e6cuyo4sakclr75rfmgr6eluotc6z75ti57chs

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

1.17PiB

Signer Address

f1krmypm4uoxxf3g7okrwtrahlmpcph3y7rbqqgfa

Id

db392b62-6984-4e2e-a2e0-b7b3818ec84f

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacea2vc7oe2ofdas5e6cuyo4sakclr75rfmgr6eluotc6z75ti57chs

amughal commented 1 year ago

@cryptowhizzard Thank you

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

-- Commented by Stale Bot.

amughal commented 1 year ago

This data set has been sealing, even today a lot of deals were sent out. Was this message sent out in error? Thanks

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 30% of total datacap - f02229460: 30.68%, f02250603: 42.05%

Deal Data Replication

⚠️ 99.90% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 30% of total datacap - f02229460: 35.46%, f02250603: 34.80%

⚠️ 1 storage providers sealed too much duplicate data - f01697248: 36.63%

Deal Data Replication

⚠️ 99.92% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 30% of total datacap - f02229460: 33.42%, f02250603: 30.11%

Deal Data Replication

⚠️ 99.61% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 5

Multisig Notary address

f02049625

Client address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

DataCap allocation requested

1.17PiB

Id

df7c998b-068e-48c8-ba7b-efedc0c2b9ee

amughal commented 1 year ago

Hello @Kevin-FF-USA @raghavrmadya @cryptowhizzard @liyunzhi-666 @kernelogic @jamerduhgamer @zcfil . Can someone please start the approval of next tranche? Appreciated.

Data is across three major regions (Canada, USA, S. Korea) There is no CID sharing

If there are any other questions please let me know.

xinaxu commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacedf4uyltff3lgyrs7nydrvsp35b5wj7dexuvjjeslmftedvvj2iho

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

1.17PiB

Signer Address

f1k3ysofkrrmqcot6fkx4wnezpczlltpirmrpsgui

Id

df7c998b-068e-48c8-ba7b-efedc0c2b9ee

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedf4uyltff3lgyrs7nydrvsp35b5wj7dexuvjjeslmftedvvj2iho

zcfil commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 30% of total datacap - f02229460: 33.42%

Deal Data Replication

⚠️ 75.61% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

jamerduhgamer commented 1 year ago

Hi @amughal, please continue to lower the SP distribution (which is almost under 30%) and the deal data replication (which is lower than the last check). Willing to support the good trend.

jamerduhgamer commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacea5iiscwpfuqaltp2p4qixvtxwgppm34bwr3zyuicomwryvv3c2xy

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

1.17PiB

Signer Address

f1kqdiokoeubyse4qpihf7yrpl7czx4qgupx3eyzi

Id

df7c998b-068e-48c8-ba7b-efedc0c2b9ee

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacea5iiscwpfuqaltp2p4qixvtxwgppm34bwr3zyuicomwryvv3c2xy

amughal commented 1 year ago

@jamerduhgamer @xinaxu Thank you for the approval, appreciated.

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 30% of total datacap - f02229460: 39.67%

Deal Data Replication

⚠️ 77.90% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 30% of total datacap - f02229460: 40.08%

Deal Data Replication

⚠️ 78.05% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

amughal commented 1 year ago

Adding another ServiceProvider: f02832654 Sealing has started, completed initial 100TB.

herrehesse commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 30% of total datacap - f02229460: 41.23%

Deal Data Replication

⚠️ 77.37% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

herrehesse commented 1 year ago

One entity is storing more then others, can you explain?

amughal commented 1 year ago

Others are catching up as seen in the distribution load graph. Next SP that will reach to equal distribution will be f01697248.

image

herrehesse commented 12 months ago

Will track, thank you.

amughal commented 12 months ago

Will track, thank you.

You are welcome. I am running the check again, that should confirm my commitment per my last statement.

amughal commented 12 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 12 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 30% of total datacap - f02229460: 41.23%

Deal Data Replication

⚠️ 77.37% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

amughal commented 12 months ago

Seems like FIL backend takes time to generate the fresh results. As of now, despite that the SP has consumed another about 100TB, load distribution still shows the same number. Guess, will run this report again in a week.

large-datacap-requests[bot] commented 11 months ago

DataCap Allocation requested

Request number 6

Multisig Notary address

f02049625

Client address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

DataCap allocation requested

1.17PiB

Id

3e25ca0e-a7ed-45ba-9886-59462e2ed507

joshua-ne commented 11 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 30% of total datacap - f01697248: 33.77%, f02229460: 35.48%

Deal Data Replication

⚠️ 81.31% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

joshua-ne commented 11 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacec6ixelkgz6gqtyaxt3dzy25aqtgd4guuqurkyhtd2aqhiyezndmu

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

1.17PiB

Signer Address

f1xzff5xup63o5sygr2swp4zvcajg54lotliimdty

Id

3e25ca0e-a7ed-45ba-9886-59462e2ed507

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacec6ixelkgz6gqtyaxt3dzy25aqtgd4guuqurkyhtd2aqhiyezndmu

amughal commented 11 months ago

Thank you @junyaoren

amughal commented 11 months ago

Waiting for the next notary for the final sign off, thanks in advance.

amughal commented 11 months ago

Hello @Sunnyiscoming @MatrixStorage, Could you please help in the final signature? Thank you

psh0691 commented 11 months ago

checker:manualTrigger

filplus-checker-app[bot] commented 11 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 30% of total datacap - f01697248: 37.27%, f02229460: 33.61%

Deal Data Replication

⚠️ 82.30% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard.

psh0691 commented 11 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceah377rq7zzlylmcinpvtrki3oruhcnulyrtip7bv7m2ptgkvp5ru

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

1.17PiB

Signer Address

f1qdko4jg25vo35qmyvcrw4ak4fmuu3f5rif2kc7i

Id

3e25ca0e-a7ed-45ba-9886-59462e2ed507

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceah377rq7zzlylmcinpvtrki3oruhcnulyrtip7bv7m2ptgkvp5ru

large-datacap-requests[bot] commented 11 months ago

DataCap Allocation requested

Request number 6

Multisig Notary address

f02049625

Client address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

DataCap allocation requested

1.17PiB

Id

683d4911-da84-4402-95a7-dabc6e9eb6ca

amughal commented 11 months ago

Thank you @psh0691 for the second approval. However after your approval, the above message again goes back to the requested stage, do I need third approval or is this a system backend issue?

psh0691 commented 11 months ago

@amughal It seems that the allocable DataCap has run out. I reported it to the notary channel, and when the allocable DataCap is filled, I will probably have to sign it again.

amughal commented 11 months ago

@psh0691 Understood. Thank you.

psh0691 commented 11 months ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceabhc2tipccwzjy6jperh2r6nseqvy4kyfczm5r6bekjxetnyhdvc

Address

f1b4u4eclr63rjz2wqbnlso75vs5p5qp4rdmj45ai

Datacap Allocated

1.17PiB

Signer Address

f1qdko4jg25vo35qmyvcrw4ak4fmuu3f5rif2kc7i

Id

683d4911-da84-4402-95a7-dabc6e9eb6ca

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceabhc2tipccwzjy6jperh2r6nseqvy4kyfczm5r6bekjxetnyhdvc

psh0691 commented 11 months ago

The allocable DC was filled, so re-signed.