filecoin-project / Allocator-Governance

7 stars 32 forks source link

Application for DataCap Refilling from NonEntropy #63

Closed joshua-ne closed 1 month ago

joshua-ne commented 3 months ago

Allocator Application: filecoin-project/notary-governance#1022

  1. Since becoming an Allocator, we have received a total of 5PiB DataCap, and to date, we have issued about 4PiB. To better serve the Clients, we are submitting a refilling application.
  2. We manage the Clients' application and approval process in accordance with the guidelines of DataCap governance. We inspect the Clients' data content, the location of SPs, and the success rate of CID retrieval. Specially, we are trying to verify SP's locations not only by their broadcasted IP, but also we ask clients to provide the phone number of the SP's, and we would call and verify their country/region codes.
  3. Our strategy for issuing quotas to Clients consistently follows the principle of small-scale trials in the initial stages, gradually expanding in the later stages. We keep very close contact with our clients.
  4. Our Clients allocate DataCapto different SPs in reasonable proportions and ensure that SPs do not store duplicate CIDs.
  5. Last but most importantly, We are actively participating in community governance, providing constructive suggestions while striving to improve the effectiveness of DataCapmanagement. We have noticed that there are lots of arguments about evaluating SP's retrieval rate with SPARK. And we have also heard a lot of complaints from our own clients and their related SP's. There are a lot of work/issues with current evaluation methods. For example: a. Support for multiple retrieval methods, http, graphsync, legacy lotus market vs. boost etc.
    b. Support for different systems, like Lotus/Venus/Curio, etc.
    c. Support for deals made with DDO. d. Support for evaluating selected data_cids, since clients shall only be responsible for the retrieval of their own data_cids, but not for all history deals. We are also actively developing new tools and new processes to make this evaluation both strict and convenient for our clients and their related SP's. For now, since we are not ready to collect all cids from the network, we are requiring our clients to publicly report the CIDs of the fired deals and use them and our newly developed bots to do retrieval evaluations. We are still actively working on this, to solve the current problems listed above, once the tool is ready, we will open-source it to the community.

====================================================================== CURRENT ALLOCATIONS

1st Allocation

Client: TechToolbox--01

Application issue: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/1

The client added SPs several times in the issue, and the actual SPs are consistent. The DataCap allocations are well too.

Storage report: Note: there are some major difference between the provided SP list in the application and the SP's listed below. However, from the application issue thread, you can see our communication about the clients' change on SP's. We will ask the clients to update their applications on time in the future. https://check.allocator.tech/report/joshua-ne/FIL_DC_Allocator_1022/issues/1/1715082862262.md

Provider Location Total Deals Sealed Percentage Unique Data Duplicate Deals
f02894875 Nanchang, Jiangxi, CN CHINA UNICOM China169 Backbone 290.94 TiB 15.93% 290.94 TiB 0.00%
f0427989 Qingdao, Shandong, CN CHINA UNICOM China169 Backbone 116.44 TiB 6.38% 116.44 TiB 0.00%
f03035656new Nanchang, Jiangxi, CN CHINA UNICOM China169 Backbone 50.38 TiB 2.76% 50.38 TiB 0.00%
f02200472 Chengdu, Sichuan, CN CHINANET SiChuan Telecom Internet Data Center 49.47 TiB 2.71% 49.47 TiB 0.00%
f02370792 Shenzhen, Guangdong, CN CHINANET-BACKBONE 147.31 TiB 8.07% 147.31 TiB 0.00%
f02942808 Jiangmen, Guangdong, CN CHINANET-BACKBONE 81.00 TiB 4.44% 80.97 TiB 0.04%
f02036170 Hong Kong, Hong Kong, HK HGC Global Communications Limited 181.38 TiB 9.93% 181.38 TiB 0.00%
f02036171 Hong Kong, Hong Kong, HK HGC Global Communications Limited 141.44 TiB 7.74% 141.44 TiB 0.00%
f02951064 Hangzhou, Zhejiang, CN JINHUA, ZHEJIANG Province, P.R.China. 69.34 TiB 3.80% 69.34 TiB 0.00%
f01025366 Qingdao, Shandong, CN Qingdao, Shandong Province, P.R.China. 218.38 TiB 11.96% 218.38 TiB 0.00%
f02951213 Singapore, Singapore, SG StarHub Ltd 101.31 TiB 5.55% 101.31 TiB 0.00%
f03073961 Văn Điển, Hanoi, VN VNPT Corp 190.38 TiB 10.42% 190.38 TiB 0.00%
f03073919 Văn Điển, Hanoi, VN VNPT Corp 188.63 TiB 10.33% 188.63 TiB 0.00%

Allocation 2

Client: Commoncrawl

Application issue: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/7

Storage report: Provider Location Total Deals Sealed Percentage Unique Data Duplicate Deals
f02036170 Hong Kong, Hong Kong, HK HGC Global Communications Limited 25 TiB 25.00% 25 TiB 25.00%
f02036171 Hong Kong, Hong Kong, HK HGC Global Communications Limited 25 TiB 25.00% 25 TiB 25.00%
f03073961 Văn Điển, Hanoi, VN VNPT Corp 25 TiB 25.00% 25 TiB 25.00%
f03073919 Văn Điển, Hanoi, VN VNPT Corp 25 TiB 25.00% 25 TiB 25.00%

Allocation 3

Client: DataFortress

Application issue: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/12

The client has been allocating DataCap in stages. Metrics such as the allocated portion, duplicate ratio, and retrieval rate are currently performing well.

Storage report: https://check.allocator.tech/report/joshua-ne/FIL_DC_Allocator_1022/issues/12/1718790129978.md

Provider Location Total Deals Sealed Percentage Unique Data Duplicate Deals Mean Spark Retrieval Success Rate 7d
f01989014 Kuala Lumpur, Kuala Lumpur, MY Extreme Broadband - Total Broadband Experience 26.63 TiB 21.04% 26.63 TiB 0.00% -
f01989013 Kuala Lumpur, Kuala Lumpur, MY Extreme Broadband - Total Broadband Experience 26.56 TiB 20.99% 26.56 TiB 0.00% -
f02105010 Kuala Lumpur, Kuala Lumpur, MY Extreme Broadband - Total Broadband Experience 25.50 TiB 20.15% 25.50 TiB 0.00% -
f01989015 Kuala Lumpur, Kuala Lumpur, MY Extreme Broadband - Total Broadband Experience 20.38 TiB 16.10% 20.38 TiB 0.00% -
f02252024 Asagaya-minami, Tokyo, JP TOKAI Communications Corporation 9.56 TiB 7.56% 9.56 TiB 0.00% -
f02252023 Asagaya-minami, Tokyo, JP TOKAI Communications Corporation 9.31 TiB 7.36% 9.31 TiB 0.00% -
f01422327 Asagaya-minami, Tokyo, JP TOKAI Communications Corporation 8.63 TiB 6.81% 8.63 TiB 0.00% -
joshua-ne commented 3 months ago

This is a screenshot to prove that we are seeking help for collected CIDs. Hopefully we can either get a free API to use or we will develop and maintain our own database soon.

85af3f8331db2b29da3c2327eceeb5f

joshua-ne commented 3 months ago

This is a screenshot showing the retrieval-bot in our most recent client's application. So now in our allocation repo, when commenting the triggering words (trigger:run_retrieval_test) along with a provided csv link providing clients/miner_id/data_cids, the bot will run the retrieval test and comment back with the result.

91a5c9d8fa8163bd1e5b1cb7502da2c

joshua-ne commented 3 months ago

@Kevin-FF-USA Hi Kevin, this is our application for DC refill. Please let me know if you have any questions or if there is anything I can do. Thank you!

joshua-ne commented 3 months ago

And to update, we are improving our retrieval test bot, so that anyone can use it. In our current design, besides the triggering keywords, the bot will parse two items from the comment: 1. the URL to a CSV file which contains the miner_id and data_cid for the current batch of deals; 2. (optional) a retrieval protocol, which includes retrieval command format and output keywords indicating a successful retrieval. If no retrieval protocols are provided, the bot will try some of the most common preset ones as default.

Also, we are developing a verification bot to ensure/check the CSV files provided by the clients are indeed reflecting the datacap we have allocated to them. But ideally, it would be best if we can directly utilize some public API to do so or to get the deal/data_cid list, say, providing a client address and a miner_id, retrieve the data_cid list of deals between this client and miner_id.

joshua-ne commented 3 months ago

@Kevin-FF-USA Hi just checking in to see any progress or updates. Our datacap is running extremely low.

filecoin-watchdog commented 2 months ago

Allocator Compliance Report: https://compliance.allocator.tech/report/f03018029/1721399558/report.md

filecoin-watchdog commented 2 months ago

First example: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/1

This application is the ninth instance of USGS on Filecoin LINK. Can you justify why this additional copy is needed on Filecoin?

Retrievals are 0% on all applications. https://check.allocator.tech/report/joshua-ne/FIL_DC_Allocator_1022/issues/1/1719503552887.md

Why did the allocator continue to give DataCap? @joshua-ne

filecoin-watchdog commented 2 months ago

Second example: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/12

https://check.allocator.tech/report/joshua-ne/FIL_DC_Allocator_1022/issues/12/1721400324045.md

0 retrievals

joshua-ne commented 2 months ago

This application is the ninth copy of USGS on Filecoin. What addition value is added to deserve DataCap? How was the data prepared differently?

Hi @filecoin-watchdog , thank you for pointing this out. Indeed, we did not notice this when we were reviewing the application. When you say "ninth copy", do you mean the ninth client storing this dataset, which would make it at least 9 * 4 = 36 copies, or just two previous clients, who each stored about 4 copies?

I totally agree with you that if about 30 copies already exist on Filecoin network, then it would not be appropriate to store 4-5 more copies of it. So I am wondering where do you find this count number, do we have a database or reference for such? If we do not, we might want to start building one for the allocators' convenience to check. And also, we might want to setup a reference upper bound for the number of copies for typical open datasets.

Anyway, thanks for pointing this out and we will definitely pay more attention the this issue in the future.

joshua-ne commented 2 months ago

And for the issue of retrieval, as I have stated quite clearly in this application, we actually really pay LOTS of efforts and attention to make sure the SP's are retrievable to the public. And we also stated how we are progressing to a better and more clear path.

While I agree that SPARK is one good tool for testing SPs' retrieval rate, we have to admit that it has its own limits. And a lot of SPs' do have their costs and difficulties adjusting themselves for SPARK system. In our opinion:

  1. Blockchain is an open, democratic world, and it is usually a very bad sign if we relied too much on a single system, don't we all agree with this? That's why we are making our efforts in building and promoting new tools to run in parallel.
  2. We believe that if only the data from some SPs' are retrievable, whatever method they provide, AND it is usable for most of the public, then it would be a pass. So instead of asking SPs' to accommodate, we are working on making our retrieval bot flexible. It can take the command arguments the SPs' provide and run the test accordingly. Please expect our following releases when it is ready!
filecoin-watchdog commented 2 months ago

This application is the ninth copy of USGS on Filecoin. What addition value is added to deserve DataCap? How was the data prepared differently?

Hi @filecoin-watchdog , thank you for pointing this out. Indeed, we did not notice this when we were reviewing the application. When you say "ninth copy", do you mean the ninth client storing this dataset, which would make it at least 9 * 4 = 36 copies, or just two previous clients, who each stored about 4 copies?

I totally agree with you that if about 30 copies already exist on Filecoin network, then it would not be appropriate to store 4-5 more copies of it. So I am wondering where do you find this count number, do we have a database or reference for such? If we do not, we might want to start building one for the allocators' convenience to check. And also, we might want to setup a reference upper bound for the number of copies for typical open datasets.

Anyway, thanks for pointing this out and we will definitely pay more attention the this issue in the future.

9th instance of this dataset on Filecoin. Yes each instance has many copies.

joshua-ne commented 2 months ago

Well, then I agree it would be too many copies. We will pay more attention to such situations in the future. Yet the community might need a better record for keeping track of such public datasets and having guidelines for the number of active copies. A potential difficulty would be to check how many previous copies are still both 'active' AND 'retrievable', instead of simply adding up all the numbers.

techgood123 commented 2 months ago

@filecoin-watchdog I am a DC applicant. I am basically the only one who discovered the USGS and used it to apply for DC.considering that many notaries may not agree to our application, we submitted USGS applications to multiple notaries.there is a lot of data on the USGS website, and it is still increasing. At present, we have not completely used the data, nor have we downloaded the data repeatedly.So, I don't quite agree with your point of view.

techgood123 commented 2 months ago

In terms of logic and action: We finally found a data set. We will apply to 10 notaries at the same time. 50% of them may agree, but only 25% may continue to give us the quota. . Therefore, the issue of data duplication does not need to be considered for the time being.

joshua-ne commented 2 months ago

Hi, @Kevin-FF-USA and @galen-mcandrew , as we discussed in the meeting, we would start working on the following things to provide a better alternative retrieval verifier to the community:

  1. We will look into datacapstats.io to see if we can directly pull out the deals/datacids between a client and SP pair. Hopefully it could work, or else we might need more input for a database. Alternatively, it would be nice if some existing orgs, such as SPARK as I know is one of them, who already own such databases could share the data with the public.

  2. We are at the end stage of developing and as soon as possible, we will make our retrieval test bot open-source. So that everyone could review the code and run their own copy of the retrieval bot. There could be a slight delay since we are expecting the Filecoin NV23 upgrade in the coming two weeks.

joshua-ne commented 2 months ago

Meanwhile, I would appreciate it if we could further proceed on my refill application. We do make a lot of efforts to keep the balance between making sure the datasets sealed are publicly retrievable and making it easier and faster for both data clients and the majority of the SPs' to onboard new data.

galen-mcandrew commented 1 month ago

Based on an additional compliance review, it appears this allocator is attempting to work with public open dataset clients. However, the data associated with this pathway is not currently able to be retrieved at scale, and testing for retrieval is currently noncompliant. That said, the allocator is working to develop additional tooling to support alternative retrieval sampling.

We have consistently said that we would love to see multiple different bots and tools that assess different forms of compliance. We do not have the capacity to spec and build all these alternatives directly through the Foundation, which is why we look to support ecosystem partners, such as the Meridian team with Spark. I look forward to seeing more information about this NonEntropy retrieval bot, and perhaps there could be a demo at the next Governance call in August.

As a reminder, the allocator team is responsible for verifying, supporting, and intervening with their clients. If a client is NOT providing accurate deal-making info (such as incomplete or inaccurate SP details) or making deals with noncompliant unretrievable SPs, then the allocator needs to intervene and require client updates before more DataCap should be awarded.

Before we will submit a request for more DataCap to this allocator, please verify that you will instruct, support, and require your clients to work with retrievable storage providers. If so, we will request an additional 5PiB of DataCap from RKH, to allow this allocator to show increased diligence and alignment.

@joshua-ne can you verify that you will enforce retrievability requirements, such as through Spark?

Please reply here with acknowledgement and any additional details for our review.

joshua-ne commented 1 month ago

@galen-mcandrew Yes, I WILL enforce retrievability requirements, such as through Spark.

Meanwhile, we are developing a new retrieval bot for community to use as described here: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/22

We are almost done developing and we are in the progress of handling possible unexpected errors. I would be willing to give a preview demo in the coming Governance call. Thanks!

galen-mcandrew commented 1 month ago

@joshua-ne & @Kevin-FF-USA we are really excited to see additional retrieval bot options, and look forward to a demo!