filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
109 stars 62 forks source link

[DataCap Application] Mongo2Stor #2319

Open amughal opened 9 months ago

amughal commented 9 months ago

Data Owner Name

Mongo2Stor

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

United States

Data Owner Industry

Not-for-Profit

Website

https://data.commoncrawl.org

Social Media

https://twitter.com/commoncrawl (commoncrawl)

Total amount of DataCap being requested

5 PiB

Expected size of single dataset (one copy)

1 PiB

Number of replicas to store

5

Weekly allocation of DataCap requested

500 TiB

On-chain address for first allocation

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

n/a

Share a brief history of your project and organization

Mongo2Stor (MongoStorage) is working as Storage Service Provider, DataPrep and consulting services in the Filecoin echo system. Based in Southern California, USA, Mongo2Stor is a FIL Green GOLD Certified and currently working through to be fully ESPA certified provider. The founders have vast experience in networks and systems, and have gone through multiple sessions, presentation at ESPA and featured in the Zero to One Service Provider Twitter session by Protocol Labs. This LDN request is followup to #2040, which has been a great success. Data had been stored to prominent Service Providers like Seal Storage, Simple IPFS Inc. (#2 ranking), Aligned SaaS provider, PikNik (Medula) and many others. CommonCrawl has new monthly archives since the launch of LDN #2040, and since then a year worth of data needs to be archived and make it available on the Filecoin network.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

n/a

Describe the data being stored onto Filecoin

https://data.commoncrawl.org/crawl-data/index.html CC-MAIN-2023-50 CC-MAIN-2023-40 CC-MAIN-2023-23 CC-MAIN-2023-14 CC-MAIN-2022-49 CC-MAIN-2022-40 CC-MAIN-2022-33 CC-MAIN-2022-27

Where was the data currently stored in this dataset sourced from

Other

If you answered "Other" in the previous question, enter the details here

Commoncrawl provided hosted services

If you are a data preparer, what is your location (City and Country)

Chino, USA

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

Singularity is an excellent tool for CAR generation. I have used it extensively for the other LDN application.

If you are not preparing the data, who will prepare the data? (Provide name and business)

n/a

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

n/a

Please share a sample of the data

Data Type File List #Files Total Size Compressed (TiB) Segments segment.paths.gz 100
WARC warc.paths.gz 90000 99.25 WAT wat.paths.gz 90000 22.99 WET wet.paths.gz 90000 9.30 Robots.txt robotstxt.paths.gz 90000 0.18 Non-200 responses non200responses.paths.gz 90000 3.43 URL index cc-index.paths.gz 302 0.25 Columnar URL index cc-index-table.paths.gz 900 0.28

Confirm that this is a public dataset that can be retrieved by anyone on the Network

Yes

If you chose not to confirm, what was the reason

n/a

What is the expected retrieval frequency for this data

Sporadic

For how long do you plan to keep this dataset stored on Filecoin

More than 3 years

In which geographies do you plan on making storage deals

North America, South America, Europe, Australia (continent), Africa, Asia other than Greater China

How will you be distributing your data to storage providers

HTTP or FTP server

How do you plan to choose SP

Big Data Exchange

If you answered "Others" in the previous question, what is the tool or platform you plan to use

n/a

If you already have a list of storage providers to work with, fill out their names and provider IDs below

Bitsultans, f02853198 Simple IPFS Inc., f01904546, f01697248

How do you plan to make deals to your storage providers

Lotus client

If you answered "Others/custom tool" in the previous question, enter the details here

n/a

Can you confirm that you will follow the Fil+ guideline

Yes

Application created via filplus.storage

large-datacap-requests[bot] commented 9 months ago

Thanks for your request!

Heads up, you’re requesting more than the typical weekly onboarding rate of DataCap!
large-datacap-requests[bot] commented 9 months ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

data-programs commented 9 months ago
KYC

This user’s identity has been verified through filplus.storage

Sunnyiscoming commented 9 months ago

⚠️ f01697248 has sealed 40.34% of total datacap.

⚠️ f02846602 has unknown IP location.

  1. Please explain about the abnormal things in https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2040

If you already have a list of storage providers to work with, fill out their names and provider IDs below Bitsultans, f02853198 Simple IPFS Inc., f01904546, f01697248

  1. Best practice for storing large datasets includes ideally, storing it in 3 or more regions, with 4 or more storage provider operators or owners.You should list Miner ID, Business Entity, Location of sps you will cooperate with.
github-actions[bot] commented 8 months ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

-- Commented by Stale Bot.

amughal commented 8 months ago

Hello @Sunnyiscoming, Please see below list of SPs, distributed across three continents.

SP Miner IDs Contact name SP Business Email SP Organization Name Region Using VPN? Slack handle
f02846602 Azher contact@mongostorage.tech Mongo2Stor USA No mongo
f01697248 Henry Moon contact@simpleipfs.com Simple IPFS Inc. South Korea No hyunmoon
f01904546 Henry Moon contact@simpleipfs.com Simple IPFS Inc. South Korea No hyunmoon
f02853198 Bitsultans Diego Siwer Bitsultans Argentina No Diego Siwer

Regarding your questions:

This issue is correct, and you should see more balanced results in next 2 weeks. Have been actively sealing on miner f01904546 ⚠️ f01697248 has sealed 40.34% of total datacap.

Below issue has been fixed. ⚠️ f02846602 has unknown IP location. Check the correct IP: https://filfox.info/en/peer/12D3KooWPdhRZBjt6PoM9cjLgpxjUi4uaQXPKPE62zBNBe8CSydX

Please let me know if you have any further questions. Thank you

Sunnyiscoming commented 8 months ago

The community rules that each sp cannot store more than 30%. Why did you stored 40.34% of the previous application on the same sp? Please describe the datacap storage allocation plan of this application in detail.

amughal commented 7 months ago

Hello @Sunnyiscoming , Sorry that I have not replied on this thread for sometime, it was more intentional. Background, and as I just ran the checker BOT, I have been actively sealing for better data distribution across two SPs. Previously, it was 40.34%, and as of now it is 35.33%. BOT is not as accurate, in the next few days, you would see it decreased to around 33%. I hope you would appreciate that distributing 10PiB across the SPs globally is a challenging task, and I am not that far in achieving this.

Similarly, for the current application, my aim is to provide distribution in US west coast, South Korea and South America. SPs have been identified, three of them have the collateral ready, while one is working on it. Like for the LDN 2040, I will be making sure that per FIL+, deals get distributed across 5 SPs.

Let me know if there are additional questions.

large-datacap-requests[bot] commented 7 months ago

DataCap Allocation requested

Request number 2

Multisig Notary address

f03016877

Client address

f1qwyhtmlfogwajktfabqvhqfxapiqozuxpwirmpa

DataCap allocation requested

500TiB

Id

2f77fd67-58cf-4fe1-95cc-94f1058aec4d