Closed joshua-ne closed 1 month ago
This is a screenshot to prove that we are seeking help for collected CIDs. Hopefully we can either get a free API to use or we will develop and maintain our own database soon.
This is a screenshot showing the retrieval-bot in our most recent client's application. So now in our allocation repo, when commenting the triggering words (trigger:run_retrieval_test) along with a provided csv link providing clients/miner_id/data_cids, the bot will run the retrieval test and comment back with the result.
@Kevin-FF-USA Hi Kevin, this is our application for DC refill. Please let me know if you have any questions or if there is anything I can do. Thank you!
And to update, we are improving our retrieval test bot, so that anyone can use it. In our current design, besides the triggering keywords, the bot will parse two items from the comment: 1. the URL to a CSV file which contains the miner_id and data_cid for the current batch of deals; 2. (optional) a retrieval protocol, which includes retrieval command format and output keywords indicating a successful retrieval. If no retrieval protocols are provided, the bot will try some of the most common preset ones as default.
Also, we are developing a verification bot to ensure/check the CSV files provided by the clients are indeed reflecting the datacap we have allocated to them. But ideally, it would be best if we can directly utilize some public API to do so or to get the deal/data_cid list, say, providing a client address and a miner_id, retrieve the data_cid list of deals between this client and miner_id.
@Kevin-FF-USA Hi just checking in to see any progress or updates. Our datacap is running extremely low.
Allocator Compliance Report: https://compliance.allocator.tech/report/f03018029/1721399558/report.md
First example: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/1
This application is the ninth instance of USGS on Filecoin LINK. Can you justify why this additional copy is needed on Filecoin?
Retrievals are 0% on all applications. https://check.allocator.tech/report/joshua-ne/FIL_DC_Allocator_1022/issues/1/1719503552887.md
Why did the allocator continue to give DataCap? @joshua-ne
This application is the ninth copy of USGS on Filecoin. What addition value is added to deserve DataCap? How was the data prepared differently?
Hi @filecoin-watchdog , thank you for pointing this out. Indeed, we did not notice this when we were reviewing the application. When you say "ninth copy", do you mean the ninth client storing this dataset, which would make it at least 9 * 4 = 36 copies, or just two previous clients, who each stored about 4 copies?
I totally agree with you that if about 30 copies already exist on Filecoin network, then it would not be appropriate to store 4-5 more copies of it. So I am wondering where do you find this count number, do we have a database or reference for such? If we do not, we might want to start building one for the allocators' convenience to check. And also, we might want to setup a reference upper bound for the number of copies for typical open datasets.
Anyway, thanks for pointing this out and we will definitely pay more attention the this issue in the future.
And for the issue of retrieval, as I have stated quite clearly in this application, we actually really pay LOTS of efforts and attention to make sure the SP's are retrievable to the public. And we also stated how we are progressing to a better and more clear path.
While I agree that SPARK is one good tool for testing SPs' retrieval rate, we have to admit that it has its own limits. And a lot of SPs' do have their costs and difficulties adjusting themselves for SPARK system. In our opinion:
This application is the ninth copy of USGS on Filecoin. What addition value is added to deserve DataCap? How was the data prepared differently?
Hi @filecoin-watchdog , thank you for pointing this out. Indeed, we did not notice this when we were reviewing the application. When you say "ninth copy", do you mean the ninth client storing this dataset, which would make it at least 9 * 4 = 36 copies, or just two previous clients, who each stored about 4 copies?
I totally agree with you that if about 30 copies already exist on Filecoin network, then it would not be appropriate to store 4-5 more copies of it. So I am wondering where do you find this count number, do we have a database or reference for such? If we do not, we might want to start building one for the allocators' convenience to check. And also, we might want to setup a reference upper bound for the number of copies for typical open datasets.
Anyway, thanks for pointing this out and we will definitely pay more attention the this issue in the future.
9th instance of this dataset on Filecoin. Yes each instance has many copies.
Well, then I agree it would be too many copies. We will pay more attention to such situations in the future. Yet the community might need a better record for keeping track of such public datasets and having guidelines for the number of active copies. A potential difficulty would be to check how many previous copies are still both 'active' AND 'retrievable', instead of simply adding up all the numbers.
@filecoin-watchdog I am a DC applicant. I am basically the only one who discovered the USGS and used it to apply for DC.considering that many notaries may not agree to our application, we submitted USGS applications to multiple notaries.there is a lot of data on the USGS website, and it is still increasing. At present, we have not completely used the data, nor have we downloaded the data repeatedly.So, I don't quite agree with your point of view.
In terms of logic and action: We finally found a data set. We will apply to 10 notaries at the same time. 50% of them may agree, but only 25% may continue to give us the quota. . Therefore, the issue of data duplication does not need to be considered for the time being.
Hi, @Kevin-FF-USA and @galen-mcandrew , as we discussed in the meeting, we would start working on the following things to provide a better alternative retrieval verifier to the community:
We will look into datacapstats.io to see if we can directly pull out the deals/datacids between a client and SP pair. Hopefully it could work, or else we might need more input for a database. Alternatively, it would be nice if some existing orgs, such as SPARK as I know is one of them, who already own such databases could share the data with the public.
We are at the end stage of developing and as soon as possible, we will make our retrieval test bot open-source. So that everyone could review the code and run their own copy of the retrieval bot. There could be a slight delay since we are expecting the Filecoin NV23 upgrade in the coming two weeks.
Meanwhile, I would appreciate it if we could further proceed on my refill application. We do make a lot of efforts to keep the balance between making sure the datasets sealed are publicly retrievable and making it easier and faster for both data clients and the majority of the SPs' to onboard new data.
Based on an additional compliance review, it appears this allocator is attempting to work with public open dataset clients. However, the data associated with this pathway is not currently able to be retrieved at scale, and testing for retrieval is currently noncompliant. That said, the allocator is working to develop additional tooling to support alternative retrieval sampling.
We have consistently said that we would love to see multiple different bots and tools that assess different forms of compliance. We do not have the capacity to spec and build all these alternatives directly through the Foundation, which is why we look to support ecosystem partners, such as the Meridian team with Spark. I look forward to seeing more information about this NonEntropy retrieval bot, and perhaps there could be a demo at the next Governance call in August.
As a reminder, the allocator team is responsible for verifying, supporting, and intervening with their clients. If a client is NOT providing accurate deal-making info (such as incomplete or inaccurate SP details) or making deals with noncompliant unretrievable SPs, then the allocator needs to intervene and require client updates before more DataCap should be awarded.
Before we will submit a request for more DataCap to this allocator, please verify that you will instruct, support, and require your clients to work with retrievable storage providers. If so, we will request an additional 5PiB of DataCap from RKH, to allow this allocator to show increased diligence and alignment.
@joshua-ne can you verify that you will enforce retrievability requirements, such as through Spark?
Please reply here with acknowledgement and any additional details for our review.
@galen-mcandrew Yes, I WILL enforce retrievability requirements, such as through Spark.
Meanwhile, we are developing a new retrieval bot for community to use as described here: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/22
We are almost done developing and we are in the progress of handling possible unexpected errors. I would be willing to give a preview demo in the coming Governance call. Thanks!
@joshua-ne & @Kevin-FF-USA we are really excited to see additional retrieval bot options, and look forward to a demo!
Allocator Application: filecoin-project/notary-governance#1022
b. Support for different systems, like Lotus/Venus/Curio, etc.
c. Support for deals made with DDO. d. Support for evaluating selected data_cids, since clients shall only be responsible for the retrieval of their own data_cids, but not for all history deals. We are also actively developing new tools and new processes to make this evaluation both strict and convenient for our clients and their related SP's. For now, since we are not ready to collect all cids from the network, we are requiring our clients to publicly report the CIDs of the fired deals and use them and our newly developed bots to do retrieval evaluations. We are still actively working on this, to solve the current problems listed above, once the tool is ready, we will open-source it to the community.
====================================================================== CURRENT ALLOCATIONS
1st Allocation
Client: TechToolbox--01
Application issue: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/1
The client added SPs several times in the issue, and the actual SPs are consistent. The DataCap allocations are well too.
Storage report: Note: there are some major difference between the provided SP list in the application and the SP's listed below. However, from the application issue thread, you can see our communication about the clients' change on SP's. We will ask the clients to update their applications on time in the future. https://check.allocator.tech/report/joshua-ne/FIL_DC_Allocator_1022/issues/1/1715082862262.md
CHINA UNICOM China169 Backbone
CHINA UNICOM China169 Backbone
new
CHINA UNICOM China169 Backbone
CHINANET SiChuan Telecom Internet Data Center
CHINANET-BACKBONE
CHINANET-BACKBONE
HGC Global Communications Limited
HGC Global Communications Limited
JINHUA, ZHEJIANG Province, P.R.China.
Qingdao, Shandong Province, P.R.China.
StarHub Ltd
VNPT Corp
VNPT Corp
Allocation 2
Client: Commoncrawl
Application issue: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/7
HGC Global Communications Limited
HGC Global Communications Limited
VNPT Corp
VNPT Corp
Allocation 3
Client: DataFortress
Application issue: https://github.com/joshua-ne/FIL_DC_Allocator_1022/issues/12
The client has been allocating DataCap in stages. Metrics such as the allocated portion, duplicate ratio, and retrieval rate are currently performing well.
Storage report: https://check.allocator.tech/report/joshua-ne/FIL_DC_Allocator_1022/issues/12/1718790129978.md
Extreme Broadband - Total Broadband Experience
Extreme Broadband - Total Broadband Experience
Extreme Broadband - Total Broadband Experience
Extreme Broadband - Total Broadband Experience
TOKAI Communications Corporation
TOKAI Communications Corporation
TOKAI Communications Corporation