filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
109 stars 62 forks source link

[DataCap Application] MongoStorage - CERN Opendata #1563

Closed amughal closed 11 months ago

amughal commented 1 year ago

Data Owner Name

CERN

Data Owner Country/Region

Switzerland

Data Owner Industry

Education & Training

Website

http://opendata.cern.ch/

Social Media

https://twitter.com/cernopendata

Total amount of DataCap being requested

1PiB

Weekly allocation of DataCap requested

100TiB

On-chain address for first allocation

f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi

Custom multisig

Identifier

No response

Share a brief history of your project and organization

MongoStorage is an emerging FileCoin Service Provider. Based in Southern California, USA, and working through a plan, soon to be ESPA certified provider. The founders have vast experience in networks and systems, and have gone through multiple sessions at ESPA trainings organized by PikNik in Vegas.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

The CERN Open Data portal is the access point to a growing range of data produced through the research performed at CERN. It disseminates the preserved output from various research activities and includes accompanying software and documentation needed to understand and analyze the data.
The portal adheres to established global standards in data preservation and Open Science: the products are shared under open licenses; they are issued with a Digital Object Identifier (DOI) to make them citable objects.

Where was the data currently stored in this dataset sourced from

Other

If you answered "Other" in the previous question, enter the details here

CERN data centers in Geneva, Switzerland.

How do you plan to prepare the dataset

singularity

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

http://opendata.cern.ch/record/4900
http://opendata.cern.ch/record/24442

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Weekly

For how long do you plan to keep this dataset stored on Filecoin

More than 3 years

In which geographies do you plan on making storage deals

North America, South America, Europe

How will you be distributing your data to storage providers

HTTP or FTP server, IPFS

How do you plan to choose storage providers

Slack, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

No response

How do you plan to make deals to your storage providers

Boost client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

Sunnyiscoming commented 1 year ago

Can you provide a detailed description of your organization, MongoStorage, such as the website, established time, etc.?

Sunnyiscoming commented 1 year ago

Relevant application https://github.com/filecoin-project/filecoin-plus-client-onboarding/issues/2662 @jamerduhgamer Hi, notary. If you have more information, please disclose it here.

amughal commented 1 year ago

We are still working on a website, as this is a new website. We have secured 50K FILs (about 700TiB raw) as collateral from Darma. PiKNiK has earlier approved a 100TiB, and we are in the process of getting that into the SP along with other FIL+ deals as being the ESPA participant. As we are buying bigger storage units, this LDN request is a continuation to the earlier approval by James.

amughal commented 1 year ago

Here is the miner id: https://filfox.info/en/address/f01959735

jamerduhgamer commented 1 year ago

Hi @Sunnyiscoming, thanks for the tag! I have previously approved @amughal for 90 TiBs of this dataset as a proof of concept but then the client revealed to me that there will be > 100 TiBs of data that needs to be approved then I recommended them to submit an LDN to cover the full amount of the dataset.

@amughal is an ESPA participant so I can verify that they are a trustworthy client and SP.

herrehesse commented 1 year ago

@amughal can you please resolve your application title: Organization - Project Name

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

amughal commented 1 year ago

@herrehesse updated, please take a look.

herrehesse commented 1 year ago

Thank you!

simonkim0515 commented 1 year ago

Datacap Request Trigger

Total DataCap requested

1PiB

Expected weekly DataCap usage rate

100TiB

Client address

f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Multisig Notary address

f01858410

Client address

f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi

DataCap allocation requested

50TiB

Id

d2318a71-83c0-407c-8824-85c71723830a

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report[^1]

There is no previous allocation for this issue.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

cryptowhizzard commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacec7t2mxdiovij5zra6kiulcxgsmmd6pcep3jr2rou3johh3btwreo

Address

f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi

Datacap Allocated

50.00TiB

Signer Address

f1krmypm4uoxxf3g7okrwtrahlmpcph3y7rbqqgfa

Id

d2318a71-83c0-407c-8824-85c71723830a

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacec7t2mxdiovij5zra6kiulcxgsmmd6pcep3jr2rou3johh3btwreo

amughal commented 1 year ago

Hi @cryptowhizzard. Trying to understand the rationale behind 50TiB allocation whereas I initially requested for 1PiB, so why is this downsizing? Is there any further information that I could provide?

cryptowhizzard commented 1 year ago

Hi @amughal

https://github.com/filecoin-project/filecoin-plus-large-datasets

When clients use up > 75% of the prior DataCap allocation, a request for additional DataCap in the form of the next tranche is automatically kicked off ('subsequent allocation bot'). Notaries have access to on-chain data required to verify that the client is operating in good faith, in accordance with the principles of the program, and in line with their allocation strategy outlined in the original application. 2 notaries need to approve the next tranche of DataCap to be allocated to the client. The same notary cannot sign off on immediately subsequent allocations of DataCap, i.e., you need at minimum 4 notaries to support your application on an ongoing basis to receive multiple tranches of DataCap.

s0nik42 commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaced2fxqrwoloxuwocthoisxdobe4an3rhe2md4q65zm5wd7f3tq2wm

Address

f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi

Datacap Allocated

50.00TiB

Signer Address

f1wxhnytjmklj2czezaqcfl7eb4nkgmaxysnegwii

Id

d2318a71-83c0-407c-8824-85c71723830a

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaced2fxqrwoloxuwocthoisxdobe4an3rhe2md4q65zm5wd7f3tq2wm

amughal commented 1 year ago

Thanks you for explaining the process and the approval, appreciated. I am using Aligned for "Sealing as a service", daily growth in the last month especially in the last week is about 5~7TiB (raw). 75% of the 50TiB allocation will kick in 7 days, so I have to go through a lot of tranches in this process to reach to 1PiB as requested. I'm willing to do that, but thinking that would cause a lot of continuous noise. Based on the sealing power, would it be worthwhile to double the allocation from 50 to 100TiB?

herrehesse commented 1 year ago

@amughal Looking forward to the CID report after you sealed the data properly. Also looking forward to the distribution!

Awesome that you are working with Aligned, they are respectable community members.

If everything is correct on your first few batches, your allocation will increase rapidly.

amughal commented 1 year ago

Excellent @herrehesse. How many copies are usually needed to distribute to the other SPs? And would those copies consume out of my allocation? Sorry for these stupid questions.

herrehesse commented 1 year ago

@amughal There are no such thing as stupid questions, let me assist you. To ensure proper distribution of data and prevent datacap from solely benefiting one's own growth, the minimum number of copies required is currently 4-5, and no more than 20% of the allocation should be given to one miner ID. These copies will consume the same amount of datacap as the original data.

The main objective behind this requirement is to promote fairness and decentralisation within the network. By mandating a minimum number of copies and limiting the amount of allocation to one miner ID, we aim to prevent any single entity from monopolising the storage capacity and using it solely for their own gain. Instead, it encourages the distribution of valuable data across multiple nodes, ensuring the proper and efficient functioning of the network.

herrehesse commented 1 year ago

You can read more about our proposal on setting stricter guidelines here: https://github.com/filecoin-project/notary-governance/issues/813

amughal commented 1 year ago

Thank you, appreciated.

Sunnyiscoming commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the full report.

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 92.43% of deals are for data replicated across less than 2 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval report.

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval report.

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi

DataCap allocation requested

100TiB

Id

e359760c-f45d-4bfc-a35f-c17726332d4f

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f01858410

Client address

f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi

Rule to calculate the allocation request amount

100% of weekly dc amount requested

DataCap allocation requested

100TiB

Total DataCap granted for client so far

50TiB

Datacap to be granted to reach the total amount requested by the client (1PiB)

974TiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
1107 5 50TiB 46.94 11.20TiB
jamerduhgamer commented 1 year ago

Hi @amughal, are there any plans to spread the datacap allocation across more SPs?

amughal commented 1 year ago

Hi @jamerduhgamer, There are two more SPs in the pipeline waiting for this tranche approval. In the next round of BOT check, it should be showing 7 SPs. Thank you

jamerduhgamer commented 1 year ago

Okay sounds good. Looking to see those 7 SPs in the next round. Will approve this next datacap tranche.

jamerduhgamer commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebybbm4xuy46gbvpmcr764u23kii3xdiiwa7oyb5zfza7vpys4giq

Address

f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi

Datacap Allocated

100.00TiB

Signer Address

f1kqdiokoeubyse4qpihf7yrpl7czx4qgupx3eyzi

Id

e359760c-f45d-4bfc-a35f-c17726332d4f

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebybbm4xuy46gbvpmcr764u23kii3xdiiwa7oyb5zfza7vpys4giq

amughal commented 1 year ago

Hello @ipollo00,

Requesting next tranche. Could you please help?

Thanks

herrehesse commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 99.30% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 99.30% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

zcfil commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 99.30% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

zcfil commented 1 year ago

What is the reason for less than 3 Sp and CID shares in transactions and how to handle them

amughal commented 1 year ago

@zcfil , thank you for the feedback. https://github.com/data-preservation-programs/filplus-checker-assets/blob/main/filecoin-project/filecoin-plus-large-datasets/issues/1563/1690254735817.md

Above report shows 5 SPs.

As I mentioned earlier on the Slack as well, that I had issues with my scripts initially using the same directories as PATHs and that created CID sharing. Since then I have fixed it, and no more sharing since it was last observed.

Thanks again.

zcfil commented 1 year ago

Please pay attention to duplicate CID issues. This round of review passed, and we will continue to focus on whether the data is normal in the future

zcfil commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceca3ldde7tix77qb3simubntpdtm3dtwdrkz3tj2ow4ubj2t66bzg

Address

f1vurpgwsgteoi5ipdjf5akpvxrz7zvs7zm2oplyi

Datacap Allocated

100.00TiB

Signer Address

f1cjzbiy5xd4ehera4wmbz63pd5ku4oo7g52cldga

Id

e359760c-f45d-4bfc-a35f-c17726332d4f

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceca3ldde7tix77qb3simubntpdtm3dtwdrkz3tj2ow4ubj2t66bzg

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

amughal commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

⚠️ 99.31% of deals are for data replicated across less than 3 storage providers.

Deal Data Shared with other Clients[^3]

⚠️ CID sharing has been observed. (Top 3)

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

cryptowhizzard commented 1 year ago

Hello,

I have done a retrieval on your data.

It can be viewed here.

http://datasetcreators.com/downloadedcarfiles/1563-f02020784-f032824-49643528-baga6ea4seaqeaa6vn2owfp5ffncwbou75pcxxtc3o7hbujx4ajlfagygr4ma4jq/

The tar file does not seem a tar file at all. I downloaded it and tried unpacking but that fails.

This does not seem to be CERN data at all. Can you explain?

amughal commented 1 year ago

CERN's datasets are by the ids, and were downloaded in individual numerical directories. These are mostly root files which are binary. A master single large tar file was created for all those datasets, singularity was used to create car files (930 CAR files). To verify it is CERN's dataset, "strings cern1.car > strings.out" can be run against that file in your URL, and patterns like this will come:

TBasket>recoTracks_generalTracksRECO.obj.hitPattern.hitPattern[25] TBasket>recoTracks_generalTracks_RECO.obj.hitPattern.hitPattern_[25] TBasket>recoTracks_generalTracksRECO.obj.hitPattern.hitPattern[25]

Further, running this command "grep root strings.out", I get following:

root@storage:/bigdata-stor4# grep root strings.out
6048/98D1088B-866F-E211-864A-00304867908C.root
root
root
root
Merged.root
Merged.root
6048/DE30D494-B76E-E211-AD3A-0025905938D4.root
root
root
root
Merged.root
Merged.root
6048/221B16C0-076F-E211-98BA-003048FFD7A2.root
root
root
root
Merged.root
Merged.root
rroot
6048/56048C11-A16E-E211-B93D-00248C0BE014.root
root
root
root
Merged.root
Merged.root
6048/9A1B967B-C26E-E211-9CF5-0026189438C1.root
root
root
root
Merged.root
Merged.root
6048/6C7A2302-7E6E-E211-954A-0026189438A2.root
root
root
root
Merged.root
Merged.root
6048/D00F585C-E16E-E211-B8A8-002618943970.root
root
root
root
Merged.root
Merged.root
6048/8CAD6065-F36E-E211-A0DE-002618943894.root
root
root
root
Merged.root
Merged.root
rootYoOt
root@storage:/bigdata-stor4#