Blockchain-World-News / FIL-DC-Allocator

FIL+
0 stars 0 forks source link

Mongo2Stor - CommonCrawl #3

Open amughal opened 6 months ago

amughal commented 6 months ago

Version

1

DataCap Applicant

Mongo2Stor

Project ID

CommonCrawl

Data Owner Name

Common Crawl

Data Owner Country/Region

United States

Data Owner Industry

Not-for-Profit

Website

https://data.commoncrawl.org

Social Media Handle

https://twitter.com/commoncrawl

Social Media Type

Twitter

What is your role related to the dataset

Data Preparer

Total amount of DataCap being requested

5

Unit for total amount of DataCap being requested

PiB

Expected size of single dataset (one copy)

1

Unit for expected size of single dataset

PiB

Number of replicas to store

5

Weekly allocation of DataCap requested

400

Unit for weekly allocation of DataCap requested

TiB

On-chain address for first allocation

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

Mongo2Stor (MongoStorage) is working as Storage Service Provider, DataPrep and consulting services in the Filecoin echo system. Based in Southern California, USA, Mongo2Stor is a FIL Green GOLD Certified and currently working through to be fully ESPA certified provider. The founders have vast experience in networks and systems, and have gone through multiple sessions, presentation at ESPA and featured in the Zero to One Service Provider Twitter session by Protocol Labs.

This LDN request is followup to #2040, which has been a great success. Data had been stored to prominent Service Providers like Seal Storage, Simple IPFS Inc. (#1 Ranking QAP), Aligned SaaS provider, PikNik (Medula) and many others.

CommonCrawl has new monthly archives since the launch of LDN #2040, and since then a year worth of data needs to be archived and make it available on the Filecoin network.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

https://data.commoncrawl.org/crawl-data/index.html
CC-MAIN-2024-10     February/March 2024     3.16    123.50
CC-MAIN-2023-40     September/October 2023  3.40    134.25
CC-MAIN-2023-23     May/June 2023   3.10    119.28
CC-MAIN-2023-06     January/February 2023   3.35    121.19
CC-MAIN-2022-49     November/December 2022  3.35    127.89
CC-MAIN-2022-40     September/October 2022  3.15    115.63
CC-MAIN-2022-33     August 2022     2.55    96.52
CC-MAIN-2022-27     June/July 2022  3.10    116.34
CC-MAIN-2021-43     October 2021    3.30    119.83
CC-MAIN-2021-25     June 2021   2.45    83.79
CC-MAIN-2021-21     May 2021    2.60    93.66

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

United States

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

Singularity is an excellent tool for CAR generation. I have used it extensively for the other LDN application.

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

No, we have checked there is no prior data sealed for these datasets.

Please share a sample of the data

CommonCrawl creates and compresses data indexes and original files in multiple files. Links in these files can be retrieved individually.

Data Type   File List   #Files  Total Size
Compressed (TiB)
Segments    segment.paths.gz    100     
WARC    warc.paths.gz   90000   99.25
WAT     wat.paths.gz    90000   22.99
WET     wet.paths.gz    90000   9.30
Robots.txt  robotstxt.paths.gz  90000   0.18
Non-200 responses   non200responses.paths.gz    90000   3.43
URL index   cc-index.paths.gz   302     0.25
Columnar URL index  cc-index-table.paths.gz     900     0.28

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Sporadic

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Asia other than Greater China, Africa, North America, South America, Europe, Australia (continent), Antarctica

How will you be distributing your data to storage providers

HTTP or FTP server

How did you find your storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you used

No response

Please list the provider IDs and location of the storage providers you will be working with.

f02853198, South America
f01904546, South Korea
f01697248, South Korea
f02846602, USA
f01945089, USA

Working with another other SP in South America and one in Europe.

How do you plan to make deals to your storage providers

Boost client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

datacap-bot[bot] commented 6 months ago

Application is waiting for allocator review

psh0691 commented 6 months ago

Thank you for applying. Tools related to allocation will be updated this week. Could you please wait generously even if the verification and allocation are delayed a little?

  1. Have you ever applied for the same DC as the previous LDN?
  2. Have you been assigned all the previously applied DCs?
  3. Please let me know the link of the LDN I applied for the same content as before.
  4. Also, if it is the same as the previously requested wallet address, an error may occur, so please give me a different wallet address.
datacap-bot[bot] commented 6 months ago

Datacap Request Trigger

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

400TiB

DataCap Amount - First Tranche

50TiB

Client address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

datacap-bot[bot] commented 6 months ago

DataCap Allocation requested

Multisig Notary address

Client address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

DataCap allocation requested

50TiB

Id

6a4dbabe-13d2-43bc-9a81-81b74c2dbb67

datacap-bot[bot] commented 6 months ago

Application is ready to sign

amughal commented 6 months ago

Hello @psh0691 Thank you for your quick response, appreciated. Please see inline replies to the questions:

  1. Have you ever applied for the same DC as the previous LDN? A. No, The previous LDN 2040 had different DC.

  2. Have you been assigned all the previously applied DCs? A. The previous LDN was allocated to the 90% DC. I believe as the Allocators are approving the new applications and corresponding DC, the remaining DC on the previous LDN will not be processed. I could be wrong on this, please let me know if you have more details.

  3. Please let me know the link of the LDN I applied for the same content as before. A. This is totally different content archive from CommonCrawl. Github link for the previous LDN is: https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2040

  4. Also, if it is the same as the previously requested wallet address, an error may occur, so please give me a different wallet address. A. It is different.

Please let me know if there are any further questions. Looking forward and thank you again.

amughal commented 6 months ago

Thank you for the approval @psh0691 . Do you think it is fine to start sealing first 50TB on first miner? In the second tranche, then to the next two miners? This means that in two tranches, 50TB will be sealed on each of the three miners. Please let me know.

psh0691 commented 6 months ago

@amughal I thought it was allocationed as multiple signatures, but it was allocationed as the first signature.

It's my first allocator experience, so please understand if there's anything lacking.

The checkbot will be activated regardless of the allocator's intention, so it will be triggered and assigned normally in the next round only if it is deployed according to FIL+ rules.

psh0691 commented 6 months ago

In the next round, we will see if we can adjust the DC Amount and help you.

amughal commented 6 months ago

Thank you @psh0691

amughal commented 5 months ago

@psh0691 Just an update, that due to the STFIL issue, few SPs have been on hold for further sealing. One SP has successfully sealed about 55% of the 50TB allocation.

amughal commented 5 months ago

@psh0691 One SP "f02846602" in US west coast, can seal more data. Can you please approve a larger next tranche? Thank you.

psh0691 commented 5 months ago

@amughal You've currently used DC 55.5%. If you use 75% or more, we'll look at it when it's auto-triggered.

amughal commented 5 months ago

Okay thanks for your support. I will complete 75% sealing on this miner.

amughal commented 5 months ago

@amughal You've currently used DC 55.5%. If you use 75% or more, we'll look at it when it's auto-triggered.

@psh0691 Please take a look.

datacap-bot[bot] commented 5 months ago

Application is in Refill

datacap-bot[bot] commented 5 months ago

DataCap Allocation requested

Multisig Notary address

Client address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

DataCap allocation requested

400TiB

Id

ec1bbcc1-8369-49a8-ae7d-8be7588ff8ed

datacap-bot[bot] commented 5 months ago

Application is ready to sign

datacap-bot[bot] commented 5 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacebzsqg7sm5xt4onqrgnyiwbsnwjpvwt5oocvhqd3ritrfxfnzadug

Address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

Datacap Allocated

400TiB

Signer Address

f1qdko4jg25vo35qmyvcrw4ak4fmuu3f5rif2kc7i

Id

ec1bbcc1-8369-49a8-ae7d-8be7588ff8ed

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebzsqg7sm5xt4onqrgnyiwbsnwjpvwt5oocvhqd3ritrfxfnzadug

datacap-bot[bot] commented 5 months ago

Application is Granted

amughal commented 5 months ago

Thank you @psh0691

datacap-bot[bot] commented 5 months ago

Application is in Refill

datacap-bot[bot] commented 5 months ago

DataCap Allocation requested

Multisig Notary address

Client address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

DataCap allocation requested

800TiB

Id

c22085be-5c52-4bcc-8a5c-0a65618555a5

datacap-bot[bot] commented 5 months ago

Application is ready to sign

techgood123 commented 5 months ago

checker:manualTrigger

datacap-bot[bot] commented 5 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 40% of total datacap - f02846602: 100.00%

⚠️ All storage providers are located in the same region.

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click [here]() to view the CID Checker report.

amughal commented 5 months ago

Due to the STFIL situation, waiting for other two miners, they have been locked so far. I am hoping in a week time they would be able to restart sealing.

datacap-bot[bot] commented 5 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacedv3mo7s5d4fw4u4sn62nimhivaf2c45vwayglqepfj54otodlul6

Address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

Datacap Allocated

800TiB

Signer Address

f1qdko4jg25vo35qmyvcrw4ak4fmuu3f5rif2kc7i

Id

c22085be-5c52-4bcc-8a5c-0a65618555a5

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedv3mo7s5d4fw4u4sn62nimhivaf2c45vwayglqepfj54otodlul6

datacap-bot[bot] commented 5 months ago

Application is Granted

amughal commented 5 months ago

Thank you @psh0691 . Appreciated.

psh0691 commented 5 months ago

checker:manualTrigger

datacap-bot[bot] commented 5 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 1 storage providers sealed more than 40% of total datacap - f01904546: 67.60%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

filecoin-watchdog commented 4 months ago

checker:manualTrigger

datacap-bot[bot] commented 4 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 40% of total datacap - f01697248: 44.02%, f01904546: 43.62%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

psh0691 commented 4 months ago

We are not receiving an allocation refill from the Plus team. Could you please wait a little longer?

amughal commented 4 months ago

Hello @psh0691 Thank you for the information. I do have datacap available in the current tranche. So no problem for now.

Thank you

psh0691 commented 4 months ago

@amughal Could you please refer to the evaluation of Plus team? https://github.com/filecoin-project/Allocator-Governance/issues/17#issue-2302765436

amughal commented 3 months ago

checker:manualTrigger

datacap-bot[bot] commented 3 months ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

⚠️ 2 storage providers sealed more than 40% of total datacap - f01904546: 43.69%, f01697248: 43.22%

⚠️ 66.67% of Storage Providers have retrieval success rate equal to zero.

⚠️ The average retrieval success rate is 0.02%

Deal Data Replication

⚠️ 100.00% of deals are for data replicated across less than 4 storage providers.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report.

amughal commented 3 months ago

Hello @psh0691 . Please check the comments. Let us know if you have any questions. https://github.com/filecoin-project/Allocator-Governance/issues/17#issue-2302765436

amughal commented 3 months ago

Hello @psh0691 Galen-mcandrew has closed https://github.com/filecoin-project/Allocator-Governance/issues/17#issue-2302765436 as Completed. Please check as well.

Please let me know what else is needed? Requesting for the next tranche.

Thank you

hyunmoon commented 2 months ago

After fixing the IPNI indexing issue, the retrieval success rate over the past 24 hours has started hitting the expected number. The low score is because the entire active deals are sampled, not just the deals from the Blockchain World News allocator. I have kept every unsealed copy since the beginning of this application, so the retrieval success rate for the allocator should be close to 100%.

spark https://spacemeridian.grafana.net/public-dashboards/32c03ae0d89748e3b08e0f08121caa14?orgId=1&from=now-24h&to=now

psh0691 commented 2 months ago

@hyunmoon Thank you, and I'm glad that the search success rate has increased.

After that, please follow the FIL+ deployment rules. (2+ continents, 4 SPs, less than 40% for 1 SP) This will be checked by the entire DC databot assigned from the beginning to the next round.

Please make sure to fulfill it.

psh0691 commented 2 months ago

So far, the use of allocated DCs has not reached 75%. If you use more than 75%, you will automatically receive a trigger request.

스크린샷, 2024-07-09 12-59-55

amughal commented 2 months ago

@psh0691 We have started sealing the deals again. I am hoping databot will kick in early next week or before.

datacap-bot[bot] commented 2 months ago

Client used 75% of the allocated DataCap. Consider allocating next tranche.

amughal commented 2 months ago

@psh0691 requesting next allocation. Thanks

datacap-bot[bot] commented 2 months ago

Application is in Refill

datacap-bot[bot] commented 2 months ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceaidbvjn2m36vnhuya2cetndj7uvhbj2lmwolhzh2pney42xdkyxi

Address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

Datacap Allocated

1PiB

Signer Address

f1qdko4jg25vo35qmyvcrw4ak4fmuu3f5rif2kc7i

Id

d2437325-e036-40cc-ac1e-48123c0bdba1

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceaidbvjn2m36vnhuya2cetndj7uvhbj2lmwolhzh2pney42xdkyxi

datacap-bot[bot] commented 2 months ago

Application is Granted