filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] EOT Dataset #2078

Closed maxDi84 closed 9 months ago

maxDi84 commented 1 year ago

Data Owner Name

The End of Term Web Archive

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

United States

Data Owner Industry

Government

Website

https://eotarchive.org/

Social Media

https://twitter.com/eotarchive

Total amount of DataCap being requested

6PiB

Expected size of single dataset (one copy)

630.1TiB

Number of replicas to store

10

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

We are an individual team from web 3 and want to spread benificial data in the world. The dataset is from EOT. The End of Term Web Archive (EOT) captures and saves U.S. Government websites at the end of presidential administrations. The EOT has thus far preserved websites from administration changes in 2008, 2012, 2016, and 2020. Data from these web crawls have been made openly available in several formats in this dataset.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

Web Archive Crawl Data (WARC and ARC formats). The End of Term Web Archive contains federal government websites (.gov, .mil, etc) in the Legislative, Executive, or Judicial branches of the government. Websites that were at risk of changing (i.e., whitehouse.gov) or disappearing altogether during government transitions were captured. Local government websites, or any other site not part of the federal government domain were out of scope.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

IPFS, lotus, singularity

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

aws s3 ls --no-sign-request s3://eotarchive/

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

2 to 3 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, North America, South America, Europe

How will you be distributing your data to storage providers

HTTP or FTP server, IPFS, Shipping hard drives, Lotus built-in data transfer

How do you plan to choose storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

SP we'll work with
f01323699
f01396576
More SP will join in this application.

How do you plan to make deals to your storage providers

Lotus client, Singularity

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

Sunnyiscoming commented 1 year ago

What is your role at the company that is behind this project? How are you connected to the data set? The website isn't yours. Do you work for the listed organization? How are you finding SPs. List a detailed plan.

maxDi84 commented 1 year ago

@Sunnyiscoming I'm an individual data preparer. Here's no company behind this application. This dataset is public data which is stored in AWS.

There are no restrictions on the use, access, and/or download of data from the End of Term Web Archive Dataset.

These words are from the dataset and the data is public to everyone. The EOT has thus far preserved websites from administration changes. I think it is useful to public. The two SPs we'll work with are recommended by a friend. I'll contact other SPs via slack or other ways.

Sunnyiscoming commented 1 year ago

Datacap Request Trigger

Total DataCap requested

6PiB

Expected weekly DataCap usage rate

1PiB

Client address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Multisig Notary address

f02049625

Client address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

DataCap allocation requested

307.19TiB

Id

b5c2bc7e-d1c2-47f4-bfc2-7cf6b0512b39

spaceT9 commented 1 year ago

Can you explain your detailed SP distribution plan?

ipollo00 commented 1 year ago

Hi @maxDi84 For the two sps you mentioned, First one seems cc, second one is a new node. how do you plan to transfer the data to them? Are they able to be reachable? Besides, I remember many sps have sealed "Web Archive Crawl Data", Have you done any research on whether other data preparer has selected this data?

maxDi84 commented 1 year ago

@spaceT9 Our plan about distribution will follow rules 'storage providers should not be storing more than 20% of the duplicated data'. We hope that we can find 4-5 SPs for the first round allocation. Then we will expand the team of cooperation.

@ipollo00 Considering about the distance between us and them, we'll ship hard drives. When we begin the cooperation, they said that they'll ready to be reachable. I don't know who stored this data before.

spaceT9 commented 1 year ago

Can you list the SPs and how about their retrieval success rates?

herrehesse commented 1 year ago

The initial request for 6 PiB using a newly created account and without establishing trust within the community is concerning. I highly recommend close the application and reconsider by starting with a smaller amount to build credibility and trust before making larger requests.

Advising notaries not to engage.

zcfil commented 1 year ago

The initial request for 6 PiB using a newly created account and without establishing trust within the community is concerning. I highly recommend close the application and reconsider by starting with a smaller amount to build credibility and trust before making larger requests.

Advising notaries not to engage.

@herrehesse Hi, I don't know why new accounts are not allowed to apply for Big data sets. I haven't seen this rule in fil+. Maybe you can propose to add it.

zcfil commented 1 year ago

@maxDi84 I'm glad you provided such rich data to be included in the filecoin. There are requirements for the number of SP copies, including geographical location. This review has been approved, and if you have any questions during the storage process, please feel free to contact me. Thank you

zcfil commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzaceajfc3ov4ne472jeqmejc5amaeoa7giusmyjox4xiw5xcc5nevdds

Address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

Datacap Allocated

307.19TiB

Signer Address

f1cjzbiy5xd4ehera4wmbz63pd5ku4oo7g52cldga

Id

b5c2bc7e-d1c2-47f4-bfc2-7cf6b0512b39

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceajfc3ov4ne472jeqmejc5amaeoa7giusmyjox4xiw5xcc5nevdds

Bennyyangpu commented 1 year ago

Welcome more new friends to experience the fil+ program!

Bennyyangpu commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzaceavgnatw4hvwafh65pbervmjio522cjetzzguep2frkgj2akp5ine

Address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

Datacap Allocated

307.19TiB

Signer Address

f174fg3bqbln3zjnkxtyf6s54txqkr7yqkj6cig7y

Id

b5c2bc7e-d1c2-47f4-bfc2-7cf6b0512b39

You can check the status of the message here: https://filfox.info/en/message/bafy2bzaceavgnatw4hvwafh65pbervmjio522cjetzzguep2frkgj2akp5ine

herrehesse commented 1 year ago

Zero DD by both @Aifabot-Cloud & @zcfil committing to 300T of datacap.

@raghavrmadya @Kevin-FF-USA Please immediately act on this behaviour.

ghost commented 1 year ago

We acknowledge that @maxDi84 did not yet answer a question related to SP entities they are working with above. LINK yet the application was still signed by @Aifabot-Cloud and @zcfil.

Notaries please note that applications can no longer be signed with out clear proof of a distributed storage plan by the client. Max said "they will find 4-5 SPs for the first round". They need to actually find and list the SPs before allocations are made.

ghost commented 1 year ago

Hello @maxDi84 per the new guidelines https://github.com/filecoin-project/notary-governance/issues/922 for Open Dataset applicants, please complete the following Fil+ registration form to identify yourself as the applicant and also please add the contact information of the SP entities you are working with to store copies of the data.

This information will be reviewed by Fil+ Governance team to confirm validity toward the Fil+ guideline of a distributed storage plan and then the application will be approved for additional notary review. Let us know if you have any questions.

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 2

Multisig Notary address

f02049625

Client address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

DataCap allocation requested

512TiB

Id

c2a53ce9-0f48-4c7d-b9f3-9bbe216ba454

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f02049625

Client address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

Rule to calculate the allocation request amount

100% weekly > 0.5PiB, requesting 0.5PiB

DataCap allocation requested

512TiB

Total DataCap granted for client so far

307.18TiB

Datacap to be granted to reach the total amount requested by the client (6PiB)

5.70PiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
3393 2 307.18TiB 50.19 78.09TiB
MRJAVAZHAO commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

maxDi84 commented 1 year ago

@Filplus-govteam I have completed the form.

MRJAVAZHAO commented 1 year ago

Can you provide distributed storage plan here? There are only 3 sps in the first datacap distribution.

maxDi84 commented 1 year ago

@MRJAVAZHAO Our datacap for the first round is just a little amount. We'll cooperate with more SPs who support retrieval to do much more decentralized distribution. We are communicating with them and it can be 4-6 SPs in next round.

MRJAVAZHAO commented 1 year ago

Can you list the name, node, location, contact information?

maxDi84 commented 1 year ago

Yes, here's the SPs we are cooperating with now. f02232088 | Leo | Chengdu, Sichuan f02232007 | Martin | Chengdu, Sichuan f01955030 | Nico | Shanghai

ghost commented 1 year ago

We can confirm this is what was submitted on the registration form as well. No contact information included.

Holiday507 commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacecqtmiynibxilgtnxzz4a53eqavhifzekzvgldjkhi5finpkob5ls

Address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

Datacap Allocated

512.00TiB

Signer Address

f1sa3dp3a7fwirrsxjdthvzneo7rnjcrrfllsnjpq

Id

c2a53ce9-0f48-4c7d-b9f3-9bbe216ba454

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecqtmiynibxilgtnxzz4a53eqavhifzekzvgldjkhi5finpkob5ls

ghost commented 1 year ago

@maxDi84 would like to see a list of SPs you plan to work with on the next allocation BEFORE it's allocated.

maxDi84 commented 1 year ago

@Filplus-govteam We plan to continue to work with SPs as shown above. Also, we contacted with f02029743 huangjiang, zhejiang; f02233608 Nick ,Hongkong for cooperation on the next allocation. Welcome notaries to help sign for our application!

Destore2023 commented 1 year ago

Dear @Filplus-govteam,

Based on this, @maxDi84 submit the form, and the CID bot and retrieval bot metrics look good. I am willing to sign this.

Besides, pseudonymity is in crypto’s DNA. Have you seen any BTC miner or ETH miner(previously) who was asked for contact information? I am sure there is no contact required when Filecoin starts in 2020,

Thanks

We can confirm this is what was submitted on the registration form as well. No contact information included.

Destore2023 commented 1 year ago

Request Approved

Your Datacap Allocation Request has been approved by the Notary

Message sent to Filecoin Network

bafy2bzacedendh63sqmjrsffmnqekyljk6gccghvao2qrjenluzgpoqg426v2

Address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

Datacap Allocated

512.00TiB

Signer Address

f1yh6q3nmsg7i2sys7f7dexcuajgoweudcqj2chfi

Id

c2a53ce9-0f48-4c7d-b9f3-9bbe216ba454

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedendh63sqmjrsffmnqekyljk6gccghvao2qrjenluzgpoqg426v2

large-datacap-requests[bot] commented 1 year ago

DataCap Allocation requested

Request number 3

Multisig Notary address

f02049625

Client address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

DataCap allocation requested

1PiB

Id

db0415c5-226f-483a-b0fa-6aa6afa95ae7

large-datacap-requests[bot] commented 1 year ago

Stats & Info for DataCap Allocation

Multisig Notary address

f02049625

Client address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

Rule to calculate the allocation request amount

200% weekly > 1PiB, requesting 1PiB

DataCap allocation requested

1PiB

Total DataCap granted for client so far

465661.3YiB

Datacap to be granted to reach the total amount requested by the client (6PiB)

465661.3YiB

Stats

Number of deals Number of storage providers Previous DC Allocated Top provider Remaining DC
15890 6 512TiB 25.75 130.53TiB
AthSmith commented 1 year ago

checker:manualTrigger

filplus-checker-app[bot] commented 1 year ago

DataCap and CID Checker Report Summary[^1]

Retrieval Statistics

Storage Provider Distribution

✔️ Storage provider distribution looks healthy.

Deal Data Replication

✔️ Data replication looks healthy.

Deal Data Shared with other Clients[^3]

✔️ No CID sharing has been observed.

[^1]: To manually trigger this report, add a comment with text checker:manualTrigger

[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger

[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...

Full report

Click here to view the CID Checker report. Click here to view the Retrieval Dashboard. Click here to view the Retrieval report.

AthSmith commented 1 year ago

Good report ,willing to support!

AthSmith commented 1 year ago

Request Proposed

Your Datacap Allocation Request has been proposed by the Notary

Message sent to Filecoin Network

bafy2bzacebpi2ccha2tndw6fdquxsjxlebez5manghhbmuo4m5oqczfaxdg3e

Address

f1aqhgcbhuxhyu2mtiecsrdizz5sx7ptlwa3tgrly

Datacap Allocated

1.00PiB

Signer Address

f1vxbqrf7rfum3n6m5u6eb4re6xj7amvsaqnzu64y

Id

db0415c5-226f-483a-b0fa-6aa6afa95ae7

You can check the status of the message here: https://filfox.info/en/message/bafy2bzacebpi2ccha2tndw6fdquxsjxlebez5manghhbmuo4m5oqczfaxdg3e

cryptowhizzard commented 1 year ago

This client is actively stalling http retrievals and blocked http range requests with a reverse proxy to prevent it's data being investigated.

It works as follows:

One set's a bandwidth limit with NGINX on the HTTP retrieval. After a random certain amount the limit is set to zero. This makes the transfer timeout. Because range retrieval is disabled in NGINX one cannot pick up where he left and needs to start all over again.

Log can be found at http://datasetcreators.com/downloadedcarfiles/logs/2078.log

Scherm­afbeelding 2023-08-08 om 21 05 27
maxDi84 commented 1 year ago

We don't put a bandwidth limit on retrieval. Please don't judge others so easily.

a_6fcc58f1-7cae-4e75-b35b-b584393ccb88
ghost commented 1 year ago

CID Report showing SPs: f02029743 | Hong Kong, Central and Western, HKANLIAN NETWORK TECHNOLOGY CO., LIMITED f01955030 | Shanghai, Shanghai, CNChina Mobile communications corporation f01955034 | Hangzhou, Zhejiang, CNCHINA UNICOM China169 Backbone f02232007 | Chengdu, Sichuan, CNCHINA UNICOM China169 Backbone f02232088 | Chengdu, Sichuan, CNCHINA UNICOM China169 Backbone f02233608 | Chengdu, Sichuan, CNCHINA UNICOM China169 Backbone f02196792 | Hong Kong, Central and Western, HK

What was listed as the SPs: f02232088 | Leo | Chengdu, Sichuan f02232007 | Martin | Chengdu, Sichuan f01955030 | Nico | Shanghai f02029743 huangjiang, zhejiang; f02233608 Nick ,Hongkong

@maxDi84 how is this considered distributed across geographic regions?

maxDi84 commented 1 year ago

@Filplus-govteam Your description is basically accurate; all of these cities are thousands of kilometers away from each other, which fits perfectly with the distribution across geography. Thanks for your comments.

cryptowhizzard commented 1 year ago

We don't put a bandwidth limit on retrieval. Please don't judge others so easily.

a_6fcc58f1-7cae-4e75-b35b-b584393ccb88

Ok, let's make this less complicated. It is up to you that your prove that you are storing real data as stated in your LDN. This is done with readily retrieval set the rules and guidelines. I can't retrieve ( See the log ) for whatever reason so we need to fix.

I am trying to receive a sample from you to do due diligence. As can bee seen in the log i am trying to fetch baga6ea4seaqmdrnu2e65kev4jsn3p5f5gjzby2tgsbhykwebwgkpwa5o6ntgafy. This is the sample i want to see.

Please make that piece available for download somewhere so i can retrieve it properly and unpack to check your data.

ghost commented 1 year ago

Let's also ask other notaries to get involved here on retrieval testing. Can you verify cryptowizard's theory.

maxDi84 commented 1 year ago

@cryptowhizzard Ok! You can send a harddisk to me so that I can copy it to you.

PluskitOfficial commented 1 year ago

We can retrieve it successfully. Please remain and continue to follow the community rules. P

cryptowhizzard commented 1 year ago

@cryptowhizzard Ok! You can send a harddisk to me so that I can copy it to you.

Again, it is your responsibility to have data readily retrievable. It's mine to do due diligence.

Upload the file somewhere so we all can check and download it.

maxDi84 commented 1 year ago

@cryptowhizzard Ok! You can send a harddisk to me so that I can copy it to you.

Again, it is your responsibility to have data readily retrievable. It's mine to do due diligence.

Upload the file somewhere so we all can check and download it.

Once more, we support retrieval and can be downloaded. You do need check for your network and your program. Or you can send a harddisk so that I can copy data to you.

cryptowhizzard commented 1 year ago

I tried from multiple sources around the globe ( AWS, Google cloud and our own network ) . All same results. I gave you the log as evidence.

Again, you need to make it retrievable without restrictions for everyone on the network, including me.

As alternative i offered you to make the 32 GB available for me on an alternate location but if you refuse your LDN will stay stuck as is.