filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
109 stars 62 forks source link

[DataCap Application] MongoStorage - CommonCrawl-2020-45 #1724

Closed amughal closed 4 months ago

amughal commented 1 year ago

Data Owner Name

Common Crawl

Data Owner Country/Region

United States

Data Owner Industry

Other

Website

https://commoncrawl.org/2020/11/october-2020-crawl-archive-now-available/

Social Media

None.

Total amount of DataCap being requested

1000 TiB

Weekly allocation of DataCap requested

50TiB

On-chain address for first allocation

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Custom multisig

Identifier

No response

Share a brief history of your project and organization

MongoStorage is an emerging FileCoin Service Provider. Based in Southern California, USA, and working through a plan, soon to be ESPA certified provider. The founders have vast experience in networks and systems, and have gone through multiple sessions at ESPA trainings organized by PikNik in Vegas.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

MongoStorage is participant in the Slingshot V3, both as SP and DataPrep.

Describe the data being stored onto Filecoin

The Common Crawl project is a corpus of web crawl data composed of over 50 billion web pages. This is the crawl archive for October 2020. The data was crawled between October 19th and November 1st and contains 2.71 billion web pages or 280 TiB of uncompressed content.

Where was the data currently stored in this dataset sourced from

Other

If you answered "Other" in the previous question, enter the details here

Data is available through CommonCrawl website. Data has already been prepared in CAR files according to the Slingshot v3 requirements. As written on the Slingshot V3, participants are allowed to place this data on the BDE exchange for bidding. Once this request is approved, I would be talking with the BDE team to place this dataset for bidding.

How do you plan to prepare the dataset

singularity

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

Primary data available through:
https://commoncrawl.org/2020/11/october-2020-crawl-archive-now-available/

List of archived files are available in the compressed file e.g.
https://data.commoncrawl.org/crawl-data/CC-MAIN-2020-45/warc.paths.gz

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Daily

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent)

How will you be distributing your data to storage providers

HTTP or FTP server

How do you plan to choose storage providers

Slack, Big data exchange

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

No response

How do you plan to make deals to your storage providers

Boost client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

xmcai2016 commented 1 year ago

To unblock Mongo as a Data Preparer in absence of Spade, I asked Mongo to leverage BDE + LDN for the datasets he prepared months ago. These datasets from Common Crawl are deemed useful to store for preservation of humanity information by the Slingshot community.

Sunnyiscoming commented 1 year ago

The minimum Datacap requested is 500TB.

amughal commented 1 year ago

@Sunnyiscoming My total requested is 100TB and minimum weekly request is 50TB. Are you saying that for BDE data publishing, the minimum requirement is 500TIB?

xmcai2016 commented 1 year ago
image

@amughal you should ask for 1000 TiB. Each dataset copy is ~100 TiB and Slingshot encourages 10 copies for disaster resiliency. If you want to apply on behalf of all your Common Crawl datasets that you prepared, this number would be even higher.

amughal commented 1 year ago

Thank you for the guidance, and that makes sense and now I understand what @Sunnyiscoming was suggesting. Let me update this request to reflect the current dataset which is ready. I will post request for the next round as more data sets will be fully ready. Thank you

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

Sunnyiscoming commented 1 year ago

Datacap Request Trigger

Total DataCap requested

1000TiB

Expected weekly DataCap usage rate

50TiB

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

DataCap allocation requested

25TiB

Id

04f1a47f-2177-44dd-a365-7e4c378aae43

Joss-Hua commented 1 year ago

This LDN is the dataset related to the famous project MoonLanding (Slingshot V3). The above links and messages have preliminarily proved that the data requirements match the application. Support and wish you all the best.

Joss-Hua commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecivsf47ktrf5h4kww2ilfctpxoa5clkryuiqy32nfupqd47vvn5e

Address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Datacap Allocated

25.00TiB

Signer Address

f1tfg54zzscugttejv336vivknmsnzzmyudp3t7wi

Id

04f1a47f-2177-44dd-a365-7e4c378aae43

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecivsf47ktrf5h4kww2ilfctpxoa5clkryuiqy32nfupqd47vvn5e

liyunzhi-666 commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceay26rg4lklh2jpwdmkeua5vzbs7zren6h6oft7ju7nmflyetk224

Address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Datacap Allocated

25.00TiB

Signer Address

f1pszcrsciyixyuxxukkvtazcokexbn54amf7gvoq

Id

04f1a47f-2177-44dd-a365-7e4c378aae43

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceay26rg4lklh2jpwdmkeua5vzbs7zren6h6oft7ju7nmflyetk224

liyunzhi-666 commented 1 year ago

I have heard of the Moon Landing project before and would like to support @amughal and @xmcai2016

amughal commented 1 year ago

Thank you all, appreciated.

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

DataCap allocation requested

50TiB

Id

c88fb01f-6f1c-4717-a914-6c2c598edab5

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f02049625

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Rule to calculate the allocation request amount

100% of weekly dc amount requested

DataCap allocation requested

50TiB

Total DataCap granted for client so far

25TiB

Datacap to be granted to reach the total amount requested by the client (1000 TiB)

975TiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
506 3 25TiB 56.68 5.41TiB
amughal commented 1 year ago

@Sunnyiscoming @Kevin-FF-USA @galen-mcandrew @raghavrmadya @simonkim0515 Hello All, Seems like there is a datacap issue in this approval. I started sending large deals out of this LDN to the SPs in the last two days, but as of this morning, it is failing. The status is asking for signature again. Is this the weekly allocation issue, or tranche, trying to understand. With using SaaS provider, I need to send deals ASAP. Any help is appreciated. Thanks

amughal commented 1 year ago

My initial request was to allocate 50TB per week. Can I get that increased to 100TB, please?

liyunzhi-666 commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval report.

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval report.

amughal commented 1 year ago

@fabriziogianni7 @liyunzhi-666 @ Hello Notaries. I need next tranche for this LDN. 1) I had accidentally a small set of CAR files mixed with another LDN, but i will make sure that this won't happen again. 2) In the next round of data seiling, next two SPs are fully GEO diverse (Asia and East Coast US).

Please let me know if you have any questions.

Thanks

liyunzhi-666 commented 1 year ago

That's OK. But I supported your application in the last round, and by definition I shouldn't support you in two consecutive rounds, so you should look for another notary. @amughal

amughal commented 1 year ago

That's OK. But I supported your application in the last round, and by definition I shouldn't support you in two consecutive rounds, so you should look for another notary. @amughal

Okay thanks @liyunzhi-666, appreciated. I will reach to others.

amughal commented 1 year ago

Checking with other Notaries. Hello, @simonkim0515 @xinaxu @kevzak Could someone please approve the next tranche?

Thanks

jamerduhgamer commented 1 year ago

@amughal is a previous ESPA participant and reputable in the ecosystem. Dataset is a public dataset so that checks out as well.

Approving the next datacap tranche however would like to see more replication across more SPs going forward.

jamerduhgamer commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebzzgpb5yts6nspipawuwwlwt4hxkyd2sbn2uotwswgodi5elwxae

Address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Datacap Allocated

50.00TiB

Signer Address

f1kqdiokoeubyse4qpihf7yrpl7czx4qgupx3eyzi

Id

c88fb01f-6f1c-4717-a914-6c2c598edab5

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebzzgpb5yts6nspipawuwwlwt4hxkyd2sbn2uotwswgodi5elwxae

amughal commented 1 year ago

Thank you @jamerduhgamer . Definitely the next tranche will be hosted by another SP and will also showcase the GEO redundancy.

ipollo00 commented 1 year ago

@amughal Hi there

  1. It would be clear if you could tell us the details about SPid you will work with for the next round.
  2. As you mentioned that you will have 10 copies. How will you improve data replication?
amughal commented 1 year ago

@amughal Hi there

1. It would be clear if you could tell us the details about SPid you will work with for the next round.

2. As you mentioned that you will have 10 copies. How will you improve data replication?

Hi @ipollo00, The next SPs are:

Thanks

cryptowhizzard commented 1 year ago

@amughal Hi there

1. It would be clear if you could tell us the details about SPid you will work with for the next round.

2. As you mentioned that you will have 10 copies. How will you improve data replication?

Hi @ipollo00, The next SPs are:

  • South Korea, miner id is f01697248. Waiting for this tranche to start sealing.
  • US East Coast, miner id is f01717477. He is also waiting for the next tranche.

Thanks

Can you confirm that these SP’s are really present at these locations and that they are not solely using the VPN construction to decieve their location, exploit Fil+ to get extra datacap whilst located in Asia

amughal commented 1 year ago

I will definitely ask them for more clarification.

On Wed, Jul 12, 2023 at 10:45 AM CryptoWhizzard @.***> wrote:

@amughal https://github.com/amughal Hi there

  1. It would be clear if you could tell us the details about SPid you will work with for the next round.

  2. As you mentioned that you will have 10 copies. How will you improve data replication?

Hi @ipollo00 https://github.com/ipollo00, The next SPs are:

  • South Korea, miner id is f01697248. Waiting for this tranche to start sealing.
  • US East Coast, miner id is f01717477. He is also waiting for the next tranche.

Thanks

Can you confirm that these SP’s are really present at these locations and that they are not solely using the VPN construction to decieve their location, exploit Fil+ to get extra datacap whilst located in Asia

— Reply to this email directly, view it on GitHub https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1724#issuecomment-1632954552, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABBKWAJRLROB4GYVGJKFCNLXP3PELANCNFSM6AAAAAAVPLKVCI . You are receiving this because you were mentioned.Message ID: <filecoin-project/filecoin-plus-large-datasets/issues/1724/1632954552@ github.com>

amughal commented 1 year ago

Hi @ipollo00 , I have received detailed replies from both SPs, pictures attached as well.

South Korea (f01697248): I have the traceroute sourcing the Boost IP address and it does show Korea. I have also received pictures of their data center.

image

US East Coast (f01717477): Location: Atlanta USA, Sungard DC. ISP is Unitas Global

image

ipollo00 commented 1 year ago

DD: Both miners are reachable. However, f01697248 have received deals from 5 clients, two of them shows are not retrievable in those reports. It might be a long time ago. Two of the client's reports show a not high retrieval rate. But, it is acceptable from my end. Will keep following up on the retrieval rate in this application. f01717477 have received deals from one client (in my opinion). The result shown in the report was acceptable. Willing to sign for this round. Based on guidelines, if those two sps are not shown in the next tranche, I‘m afraid that notaries may not support in the future.

ipollo00 commented 1 year ago

DD: Both miners are reachable. However, f01697248 have received deals from 5 clients, two of them shows are not retrievable in those reports. It might be a long time ago. Two of the client's reports show a not high retrieval rate. But, it is acceptable from my end. Will keep following up on the retrieval rate in this application. f01717477 have received deals from one client (in my opinion). The result shown in the report was acceptable. Willing to sign for this round. Based on guidelines, if those two sps are not shown in the next tranche, I‘m afraid that notaries may not support in the future.

ipollo00 commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacecoz32dgq2ejzmhift3oiuizk54vctfjm5uwvi454wfj2kuce2dve

Address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Datacap Allocated

50.00TiB

Signer Address

f1n5wlrrhoxpkgwij25xrtt7w7g2k3fhbthmdn6ri

Id

c88fb01f-6f1c-4717-a914-6c2c598edab5

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecoz32dgq2ejzmhift3oiuizk54vctfjm5uwvi454wfj2kuce2dve

amughal commented 1 year ago

Thanks @ipollo00 .

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 3

Multisig Notary address

f02049625

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

DataCap allocation requested

100TiB

Id

d03a8e9f-bcb8-423d-bbcb-d58057309f8c

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f02049625

Client address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Rule to calculate the allocation request amount

200% of weekly dc amount requested

DataCap allocation requested

100TiB

Total DataCap granted for client so far

4547.5YiB

Datacap to be granted to reach the total amount requested by the client (1000 TiB)

4547.5YiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
2086 6 50TiB 53.49 11.44TiB
jamerduhgamer commented 1 year ago

Client has responded transparently to the request to include more SPs and more geo locations. They have also done their best attempt to check the VPN concern as well. As long as the retrievability concern is addressed, willing to continue supporting.

However, I will not approve this next tranche as I was apart of the previous datacap tranche allocation.

amughal commented 1 year ago

Hello @fabriziogianni7

Could you please help in signing off this tranche?

Thank you

herrehesse commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 50% of total datacap - f01697248: 59.65%

Deal Data Replication

⚠️ 99.37% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

herrehesse commented 1 year ago

@amughal what is your plan to solve:

⚠️ 99.37% of deals are for data replicated across less than 4 storage providers.

amughal commented 1 year ago

Hello @herrehesse . Per the above report, data is already hosted with 6 unique providers. As more tranche is available, I am hoping to further increase the number of SPs and GEO diversity.

Bennyyangpu commented 1 year ago

The report isn't perfect, but I'm willing to support it this round. Hopefully, we'll see improvement in the next round.

Bennyyangpu commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecysk7ypf4r2x7ogzdigte6mxep4352m7uv5oh64opyipoo4bmh62

Address

f1c6huyblzf4s42mwxp5g7hlse4vmxeqjxv4idldy

Datacap Allocated

100.00TiB

Signer Address

f174fg3bqbln3zjnkxtyf6s54txqkr7yqkj6cig7y

Id

d03a8e9f-bcb8-423d-bbcb-d58057309f8c

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecysk7ypf4r2x7ogzdigte6mxep4352m7uv5oh64opyipoo4bmh62

amughal commented 1 year ago

Thank you @Aifabot-Cloud

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 50% of total datacap - f01697248: 59.65%

Deal Data Replication

⚠️ 99.37% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.