filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
109 stars 62 forks source link

[DataCap Application] [DEPRECATED] Internet Archive (pilot project) #22

Closed parkan closed 3 years ago

parkan commented 3 years ago

Large Dataset Notary Application

To apply for a DataCap allocation for your dataset, please fill out the following information.

Core Information

Please respond to the questions below in pargraph form, replacing the text saying "Please answer here". Include as much detail as you can in your answer!

Project details

Share a brief history of your project and organization.

The Internet Archive, a 501(c)(3) non-profit, is building a digital library of Internet sites and other cultural artifacts in digital form. Like a physical library, we provide free access to researchers, historians, scholars, the print disabled, and the general public. Our mission is to provide Universal Access to All Knowledge. See more at https://archive.org/about/

This project aims to explore the role of decentralized storage in this long-term mission.

What is the primary source of funding for this project?

We are funded through donations, grants, and by providing web archiving and book digitization services for our partners. 

What other projects/ecosystem stakeholders is this project associated with?

The dataset was compiled in collaboration with The Library of Congress, California Digital Library, University of North Texas Libraries, Internet Archive, George Washington University Libraries, Stanford University Libraries, and the U.S. Government Publishing Office.

Use-case details

Describe the data being stored onto Filecoin

The End-of-Term Web Archive captures and saves U.S. Government websites at the end of presidential administrations. This dataset represents a comprehensive crawl of the .gov domain September 2016 and January 20, 2017, at the end of the Obama Administration and just before the beginning of the Trump Administration.

Where was the data in this dataset sourced from?

Federal Government websites (.gov) in the Legislative, Executive, or Judicial branches of government, and related social media accounts. Also in scope are Federal Government Websites on other domains, such as .mil, .edu, and .com

Can you share a sample of what is in the dataset? A link to a file, an image, a table, etc., are good examples of this.

The dataset contains WARC files containing crawl data (and associated metadata) of the aforementioned sites. Their contents, when opened with a compatible viewer, are similar to https://web.archive.org/web/20170126033350/http:/globalchange.epa.gov/

The raw files look like this: https://archive.org/download/LOC-QUARTERLY-006-20161225070227072-13019-13025-wbgrp-crawl202

Confirm that this is a public dataset that can be retrieved by anyone on the Network (i.e., no specific permissions or access rights are required to view the data).

Yes, data is archived in the public interest. Archive is currently available at http://eotarchive.cdlib.org/search?f1-administration=2016

What is the expected retrieval frequency for this data?

This effort is intended primarily as an archival and exploratory usecase. Data may be accessed by researchers, periodic integrity checks, and interactive use prototypes (similar to Estuary)

For how long do you plan to keep this dataset stored on Filecoin? Will this be a permanent archival or a one-time storage deal?

The dataset is intended for long-term archival storage, depending on the outcomes of this trial.

DataCap allocation plan

In which geographies do you plan on making storage deals?

We're looking for a wide geographic distribution to model global resiliency. Miners in NA and EU geos will initially be considered.

What is your expected data onboarding rate? How many deals can you make in a day, in a week? How much DataCap do you plan on using per day, per week?

We have extensive interconnects to high bandwidth networks and robust processing capacity. Once we get through the testing phase, we expect us to be able to onboard between 50-100TiB/week.

How will you be distributing your data to miners? Is there an offline data transfer process?

Offline data transfer over the internet, using standard HTTP or purose-made protocol like Tachyon.

How do you plan on choosing the miners with whom you will be making deals? This should include a plan to ensure the data is retrievable in the future both by you and others.

Miners that are in the right geographies and have high reputation scores on public indices like filrep.io. The initial set of storage providers for testing will likely be from the MinerX Fellowship.

How will you be distributing data and DataCap across miners storing data?

We will likely be structuring our files into 32GiB chunks that will be evenly distributed in deals with the selected set of storage providers.
large-datacap-requests[bot] commented 3 years ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

flyworker commented 3 years ago

Filswan would like to support this application.

dannyob commented 3 years ago

FF will support this application.

large-datacap-requests[bot] commented 3 years ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

MegTei commented 3 years ago

Count me in

Destore2023 commented 3 years ago

ByteBase will support this application.

Fenbushi-Filecoin commented 3 years ago

Count me in

dkkapur commented 3 years ago

In the notary governance call today, we had @s0nik42 voice support for this application. We are now at 5 notaries.

steven004 commented 3 years ago

count me in

dkkapur commented 3 years ago

Actually looks like we're at 7 notaries across Oceania, NA, EU, GCR:

  1. @flyworker
  2. @dannyob
  3. @MegTei
  4. @swatchliu
  5. @Fenbushi-Filecoin
  6. @s0nik42
  7. @steven004

This is good to go! Thanks to @MegTei for agreeing to take the lead on this one as well (per message in Slack).

dkkapur commented 3 years ago

Multisig Notary requested

Notary addresses

f1hlubjsdkv4wmsdadihloxgwrz3j3ernf6i3cbpy f1k6wwevxvp466ybil7y2scqlhtnrz5atjkkyvm4a f1ystxl2ootvpirpa7ebgwl7vlhwkbx2r4zjxwe5i f1yh6q3nmsg7i2sys7f7dexcuajgoweudcqj2chfi f1yqydpmqb5en262jpottko2kd65msajax7fi4rmq f1wxhnytjmklj2czezaqcfl7eb4nkgmaxysnegwii f1qoxqy3npwcvoqy7gpstm65lejcy7pkd3hqqekna

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

100TiB

large-datacap-requests[bot] commented 3 years ago

**Multisig created and sent to RKH f01157865

Destore2023 commented 3 years ago

Multisig Notary requested

Notary addresses

f1hlubjsdkv4wmsdadihloxgwrz3j3ernf6i3cbpy f1k6wwevxvp466ybil7y2scqlhtnrz5atjkkyvm4a f1ystxl2ootvpirpa7ebgwl7vlhwkbx2r4zjxwe5i f1yh6q3nmsg7i2sys7f7dexcuajgoweudcqj2chfi f1yqydpmqb5en262jpottko2kd65msajax7fi4rmq f1wxhnytjmklj2czezaqcfl7eb4nkgmaxysnegwii f1qoxqy3npwcvoqy7gpstm65lejcy7pkd3hqqekna

Total DataCap requested

5PiB

Expected weekly DataCap usage rate

100TiB

Confirm F1 address

parkan commented 3 years ago

assuming the above is for me, I'm confirming that I am f1wp6zoxj7sydnrywvzp276x3gayghi7r6le4tcwy

flyworker commented 3 years ago

confirmed f1 address f1hlubjsdkv4wmsdadihloxgwrz3j3ernf6i3cbpy

MegTei commented 3 years ago

HI @parkan your application is a great fit and use case. A couple questions: (1) You have a number of users that will need to retrieve data can you breakdown per user frequency and tools you plan to use in retrieval (they may be different per the user needs)? (2) What would you consider success measures for this pilot to progress beyond a trial? This last question may have a few dimensions, feel free to add in here are we could have a brief chat if easier

parkan commented 3 years ago

good questions @MegTei! replies inline

HI @parkan your application is a great fit and use case. A couple questions: (1) You have a number of users that will need to retrieve data can you breakdown per user frequency and tools you plan to use in retrieval (they may be different per the user needs)?

this has a few potential tiers, depending on performance and stability:

I realize that this is a pretty broad range of possibilities -- there are just too many unknowns at this point to say for sure what will be practical in the near-to-medium term, so I am sharing the best-guess range based on what we know so far

we are also open to non-0 cost retrieval deals, though for the first point above it would be ideal to keep it at or near 0

(2) What would you consider success measures for this pilot to progress beyond a trial? This last question may have a few dimensions, feel free to add in here are we could have a brief chat if easier

I think the main criteria are:

MegTei commented 3 years ago

(Previous request, before LDN complete. Please repost and correct allocation amount)

DataCap Allocation requested

Multisig Notary address

f01157865

Client address

f1wp6zoxj7sydnrywvzp276x3gayghi7r6le4tcwy

DataCap allocation requested

100TiB

large-datacap-requests[bot] commented 3 years ago

The allocation of the requested datacap cannot be performed.The requested amount exceeds 50% of the expected weekly datacap usage rate.

large-datacap-requests[bot] commented 3 years ago

The allocation of the requested datacap cannot be performed.The requested amount exceeds 50% of the expected weekly datacap usage rate.

galen-mcandrew commented 3 years ago

This specific notary (f01157865) has not been completed yet, so we cannot allocate any datacap yet. You can see here where the notary creation issue in the notary governance repo is not closed yet: https://github.com/filecoin-project/notary-governance/issues/214

You can also check inside the Fil+ app to see the pending notaries to be created. Anyone with a ledger can sign-in as an RKH for transparency, you just will not be able to sign messages.

MegTei commented 3 years ago

HI @parkan thanks for your responses above, we'll be happy to work with you on your data storage profiles and making this both a positive experience and good outcomes. I'll follow up timeframes and should have an update for you next week. Meg

galen-mcandrew commented 3 years ago

Hi @parkan thanks for your patience! At this time we are seeing the large dataset notary successfully created (f01157865)

@MegTei I think you volunteered to take the lead here. Can you kick off the first allocation request, and tag the other notaries to approve?

Total DataCap requested: 5PiB Expected weekly DataCap usage rate: 100TiB First Allocation: 50TiB

MegTei commented 3 years ago

DataCap Allocation requested

Multisig Notary address

f01157865

Client address

f1wp6zoxj7sydnrywvzp276x3gayghi7r6le4tcwy

DataCap allocation requested

50TiB

MegTei commented 3 years ago

HI Notaries pls approve in FIL+ app @flyworker @dannyob @swatchliu @Fenbushi-Filecoin @s0nik42 @steven004

steven004 commented 3 years ago

Where is the request message? @MegTei

MegTei commented 3 years ago

@steven004 Is it not assigned to you in FIL+ app?

MegTei commented 3 years ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacea6q6kckaifzcf3vsuu5kekcojlhpwrclbln5zxpd234elfzb6rxc

Address

f1wp6zoxj7sydnrywvzp276x3gayghi7r6le4tcwy

Datacap Allocated

54975581388800

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacea6q6kckaifzcf3vsuu5kekcojlhpwrclbln5zxpd234elfzb6rxc

MegTei commented 3 years ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceaohzzol62h6q3zmuwwsp7o4dqxudy4apvfvkdqcgnbdvgscej7gi

Address

f1wp6zoxj7sydnrywvzp276x3gayghi7r6le4tcwy

Datacap Allocated

54975581388800

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceaohzzol62h6q3zmuwwsp7o4dqxudy4apvfvkdqcgnbdvgscej7gi

dannyob commented 3 years ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacea42hotnn64qdcpj5lgfpizcydv6adwkkafkryk5ztyzroilsycpg

Address

f1wp6zoxj7sydnrywvzp276x3gayghi7r6le4tcwy

Datacap Allocated

54975581388800

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacea42hotnn64qdcpj5lgfpizcydv6adwkkafkryk5ztyzroilsycpg

neogeweb3 commented 3 years ago

Actually looks like we're at 7 notaries across Oceania, NA, EU, GCR:

  1. @flyworker
  2. @dannyob
  3. @MegTei
  4. @swatchliu
  5. @Fenbushi-Filecoin
  6. @s0nik42
  7. @steven004

I tried a couple of times without success, then realized I am not on the list, hahaha

flyworker commented 3 years ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacebr5uviybji4xoiwv3fnrjnhh4gq6ok2sc7ehzhmvwunnpque5lca

Address

f1wp6zoxj7sydnrywvzp276x3gayghi7r6le4tcwy

Datacap Allocated

54975581388800

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebr5uviybji4xoiwv3fnrjnhh4gq6ok2sc7ehzhmvwunnpque5lca

steven004 commented 3 years ago

@steven004 Is it not assigned to you in FIL+ app?

I saw it now. I think the tool sent the message in Github before sending message to Filecoin network. That's why I did not find the message from filecoin browser.

steven004 commented 3 years ago

There are 3 pending proposals, why? and which one is valid?

$ ./lotus msig inspect f01157865
Balance: 0.00562949953421312 FIL
Spendable: 0.000000000001077488 FIL
Threshold: 4 / 7
Signers:
ID         Address
f01096275  f1hlubjsdkv4wmsdadihloxgwrz3j3ernf6i3cbpy
f0107408   f1k6wwevxvp466ybil7y2scqlhtnrz5atjkkyvm4a
f01103007  f1ystxl2ootvpirpa7ebgwl7vlhwkbx2r4zjxwe5i
f01105814  f1yh6q3nmsg7i2sys7f7dexcuajgoweudcqj2chfi
f0128939   f1yqydpmqb5en262jpottko2kd65msajax7fi4rmq
f0128874   f1wxhnytjmklj2czezaqcfl7eb4nkgmaxysnegwii
f0108586   f1qoxqy3npwcvoqy7gpstm65lejcy7pkd3hqqekna
Transactions:  3
ID      State    Approvals  To      Value   Method                Params
0       pending  3          f06     0 FIL   AddVerifiedClient(4)  825501b3fd975d3f9606d8e2d5cbf5ff5f66060c747e3e4700320000000000
1       pending  1          f06     0 FIL   AddVerifiedClient(4)  825501b3fd975d3f9606d8e2d5cbf5ff5f66060c747e3e4700320000000000
2       pending  1          f06     0 FIL   AddVerifiedClient(4)  82583103809e3e53011036cae1d5b2450ad6885201ba419d6cc9e5865d1b266cee830b8f2531db5b1c09412c9c05a8325853bf204700190000000000
galen-mcandrew commented 3 years ago

Hi @steven004 ! Are you able to see the message and sign using the FilPlus app? We've had some issues previously with some messages being signed in the app versus directly through Lotus, and transactions getting out of alignment.

Screen Shot 2021-09-03 at 8 35 09 AM

steven004 commented 3 years ago

@galen-mcandrew , no, I do not use a ledger wallet, but offline wallet . And I just build the message by myself. It works well and is much more flexible.

Now I know that there is only one proposal shown i FilPlus app, and which is actually the txID 0, though there are actually 3 pending ones in Filecoin network. Let move on with this one.

steven004 commented 3 years ago

Request Approved Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacebcb7i4yaxmejhgdo34bmpqpai3vygnazsk4afiwaaa7cxrdz66r6

Address

f1qoxqy3npwcvoqy7gpstm65lejcy7pkd3hqqekna

Datacap Allocated

54975581388800

You can check the status of the message here: https://filscan.io/tipset/message-detail?cid=bafy2bzacebcb7i4yaxmejhgdo34bmpqpai3vygnazsk4afiwaaa7cxrdz66r6

4 out of 7 notaries approved the proposal, so the request is applied Now, this client: f1wp6zoxj7sydnrywvzp276x3gayghi7r6le4tcwy has datacap: 55035710930944

Go, Internet Archive ...

MegTei commented 3 years ago

HI @parkan we have allocated DataCap which you can now start to use. I revert back to your responses mid August above, let us know if you need any support/ guidance from this point with your data storage profiles. In this early FIL+ program co-creation with you will help us deliver a positive experience and serve your success measures. Let us know how we can help. Meg

parkan commented 3 years ago

@MegTei thank you so much (+to everyone else!) we are scalling up dealmaking capacity and onboarding miners now

galen-mcandrew commented 3 years ago

rolling to https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/52

large-datacap-requests[bot] commented 3 years ago

Thanks for your request! :exclamation: We have found some problems in the information provided. We could not find any Expected weekly DataCap usage rate in the information provided

Please, take a look at the request and edit the body of the issue providing all the required information.
large-datacap-requests[bot] commented 3 years ago

Thanks for your request! :exclamation: We have found some problems in the information provided. We could not find any Expected weekly DataCap usage rate in the information provided

Please, take a look at the request and edit the body of the issue providing all the required information.
filplus-checker commented 1 year ago

DataCap and CID Checker Report[^1]

If this is the first time a provider takes verified deal, it will be marked as new.

For most of the datacap application, below restrictions should apply.

⚠️ f01611097 has sealed 32.00% of total datacap.

⚠️ f020378 has unknown IP location.

Provider Location Total Deals Sealed Percentage Unique Data Duplicate Deals
f01611097 San Clemente, California, US 73.78 TiB 32.00% 71.66 TiB 2.87%
f01910202 Philadelphia, Pennsylvania, US 44.68 TiB 19.38% 44.25 TiB 0.96%
f01873432 Las Vegas, Nevada, US 27.92 TiB 12.11% 26.62 TiB 4.67%
f01904630 Las Vegas, Nevada, US 23.47 TiB 10.18% 20.62 TiB 12.16%
f01882184 Singapore, Singapore, SG 18.81 TiB 8.16% 18.81 TiB 0.00%
f01826669 Philadelphia, Pennsylvania, US 18.29 TiB 7.93% 18.29 TiB 0.00%
f01851683new Las Vegas, Nevada, US 11.52 TiB 4.99% 11.52 TiB 0.00%
f01883179new Philadelphia, Pennsylvania, US 7.69 TiB 3.33% 7.69 TiB 0.00%
f01606675 Montréal, Quebec, CA 3.30 TiB 1.43% 3.30 TiB 0.00%
f066596 San Diego, California, US 274.00 GiB 0.12% 274.00 GiB 0.00%
f01091840 Montréal, Quebec, CA 176.00 GiB 0.07% 160.00 GiB 9.09%
f01199442 Heerhugowaard, North Holland, NL 146.00 GiB 0.06% 130.00 GiB 10.96%
f02576 Copenhagen, Capital Region, DK 96.00 GiB 0.04% 96.00 GiB 0.00%
f0157535 Montréal, Quebec, CA 80.00 GiB 0.03% 80.00 GiB 0.00%
f0104671 Kawasaki, Kanagawa, JP 64.00 GiB 0.03% 64.00 GiB 0.00%
f019104 Montréal, Quebec, CA 53.00 GiB 0.02% 53.00 GiB 0.00%
f09848 Rancho Santa Margarita, California, US 48.00 GiB 0.02% 48.00 GiB 0.00%
f0165400 Montréal, Quebec, CA 48.00 GiB 0.02% 48.00 GiB 0.00%
f01207045 Heerhugowaard, North Holland, NL 32.00 GiB 0.01% 32.00 GiB 0.00%
f058369 Boston, Massachusetts, US 32.00 GiB 0.01% 32.00 GiB 0.00%
f010088 Everett, Washington, US 18.00 GiB 0.01% 18.00 GiB 0.00%
f0694396 Birmingham, England, GB 17.00 GiB 0.01% 17.00 GiB 0.00%
f019551 Birmingham, England, GB 16.00 GiB 0.01% 16.00 GiB 0.00%
f030379 Seoul, Seoul, KR 16.00 GiB 0.01% 16.00 GiB 0.00%
f010446 Zaventem, Flanders, BE 16.00 GiB 0.01% 16.00 GiB 0.00%
f01199430 Heerhugowaard, North Holland, NL 3.00 GiB 0.00% 3.00 GiB 0.00%
f024184 Seoul, Seoul, KR 2.00 GiB 0.00% 2.00 GiB 0.00%
f020378 Unknown 2.00 GiB 0.00% 2.00 GiB 0.00%
f01345523 Antwerpen, Flanders, BE 2.00 GiB 0.00% 2.00 GiB 0.00%
f01784458 Oslo, Oslo, NO 1.00 GiB 0.00% 1.00 GiB 0.00%

Provider Distribution

Deal Data Replication

The below table shows how each many unique data are replicated across storage providers.

⚠️ 83.14% of deals are for data replicated across less than 4 storage providers.

Unique Data Size Total Deals Made Number of Providers Deal Percentage
26.40 TiB 26.41 TiB 1 11.45%
28.93 TiB 58.10 TiB 2 25.20%
35.01 TiB 107.20 TiB 3 46.49%
8.64 TiB 38.87 TiB 4 16.86%

Replication Distribution

Deal Data Shared with other Clients

The below table shows how many unique data are shared with other clients. Usually different applications owns different data and should not resolve to the same CID.

⚠️ CID sharing has been observed.

Other Client Application Total Deals Affected Unique CIDs Verifier
f1sw5zjcyo4mff5cbvgsgmm7uoko6gcr4tptvtkhy Glif auto verified 208.00 GiB 1 Jonathan Schwartz
f3wkp4blevjsrtbc6vwgjf2sedzjwsqmj3wsh4uex
bp4k7dggs72kbvuv7xivsnz7cnmfazpmqp3qmchmz
ms6a
Unknown 208.00 GiB 1 Unknown
f3u5dehxxe2uvehitioxhwjp27wpv72hsnuqhtz6s
ce2wzqv2skhguivnsvwbkwgczcc5x4qf6eeao34te
jqdq
Glif auto verified 16.00 GiB 1 Jonathan Schwartz

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger