Open martapiekarska opened 4 months ago
Application is waiting for allocator review
@lyjmry regarding data preparation procedures, can you clarify:
HI @kevzak 1.All open source data sets will be saved 2.The data is AI pre-training files, including 3D models, videos, pictures, natural language, compressed packages, etc.
3.pip install openxlab #Install
pip install -U openxlab #Upgrade openxlab login # Log in and enter the corresponding AK/SK. Please view AK/SK at usercenter openxlab dataset info --dataset-repo OpenDataLab/WanJuanCC # Dataset information viewing and View Dataset File List openxlab dataset get --dataset-repo OpenDataLab/WanJuanCC #Dataset download openxlab dataset download --dataset-repo OpenDataLab/WanJuanCC --source-path /README.md --target-path /path/to/local/folder #Dataset file download can use my:Access Key: 4vvwxn8abjvln190egej Secret Key: pwbmmren7dx2x6848rmdo026prqgjoqdkwyvaejk
This is a first portion - how you get data from the opendatalab repositories.
Please also include documentation on the second part - how the data is transformed into deals for filecoin. when a deal is sampled for verification, how will we be able to confirm that it is part of this dataset? (how is is chunked into car files?)
Hi I have the same problem, I have explained it in https://github.com/fidlabs/Open-Data-Pathway/issues/43#issuecomment-2200588832
Hi, can this application be pushed forward?
Total DataCap requested
5PiB
Expected weekly DataCap usage rate
1000TiB
DataCap Amount - First Tranche
50TiB
Client address
f1tcdiusiow3wwraen7rnmiog2akmgkccnqqxftta
f1tcdiusiow3wwraen7rnmiog2akmgkccnqqxftta
50TiB
349b75af-f622-4e63-9d80-fdbc527619bc
Application is ready to sign
Your Datacap Allocation Request has been approved by the Notary
bafy2bzacedc5slc73vnr3hybqnavakxma6koj2a27ant2sgq3ci3cwsordaa4
Address
f1tcdiusiow3wwraen7rnmiog2akmgkccnqqxftta
Datacap Allocated
50TiB
Signer Address
f1v24knjbqv5p6qrmfjj5xmlaoddzqnon2oxkzkyq
Id
349b75af-f622-4e63-9d80-fdbc527619bc
You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedc5slc73vnr3hybqnavakxma6koj2a27ant2sgq3ci3cwsordaa4
Application is Granted
Thank you
Client used 75% of the allocated DataCap. Consider allocating next tranche.
Request allocation for next round
This has been received and will be reviewed over the next few days, thank you.
checker:manualTrigger
✔️ Storage provider distribution looks healthy.
✔️ Data replication looks healthy.
✔️ No CID sharing has been observed.
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Click here to view the CID Checker report.
@lyjmry The retrieval rate looks good, but all SPs are in the same region, and you only use 3 SPs (while 5 were declared). Moreover, none of them match the original SPs list. Could you explain?
@lyjmry检索率看起来不错,但所有 SP 都位于同一区域,并且您只使用了 3 个 SP(而声明了 5 个)。此外,它们都不符合原始 SP 列表。您能解释一下吗?
Hi @martplo Thank you for your attention. 2: Since the current quota is very small, popularizing each SP will increase some costs for us. We hope that the data will be evenly divided when the large quota is allocated (here explains why there are 3 SPS first). 3: Because the approval takes time, the SP we contacted earlier has already encapsulated other client data sets, so we missed these SP, we will update the SP list in the next round.
1:One is in mainland China and the other is in Hong Kong, China
Application is in Refill
Total DataCap requested
5PiB
Expected weekly DataCap usage rate
1000TiB
DataCap Amount - First Tranche
200TiB
Client address
f1tcdiusiow3wwraen7rnmiog2akmgkccnqqxftta
f1tcdiusiow3wwraen7rnmiog2akmgkccnqqxftta
200TiB
198b2111-d04f-4501-b192-6250cab04bce
Application is ready to sign
Your Datacap Allocation Request has been approved by the Notary
bafy2bzacedgitvt2g2wd5aptazz6qwblhmkxnzllp733bcuhhrcxgsp6a3afe
Address
f1tcdiusiow3wwraen7rnmiog2akmgkccnqqxftta
Datacap Allocated
200TiB
Signer Address
f1msap4wvgzzv4xlzeq6kycmgx55ferfloxnt2rcy
Id
198b2111-d04f-4501-b192-6250cab04bce
You can check the status of the message here: https://filfox.info/en/message/bafy2bzacedgitvt2g2wd5aptazz6qwblhmkxnzllp733bcuhhrcxgsp6a3afe
Application is Granted
The following is the latest SP information: f03202108 Hong Kong f03157910 Shenzhen,GuangDong,China f09693 Leshan,Sichuan,China f03078786 Hong Kong f03215853 Portland,Oregon,United States f03218576 Portland,Oregon,United States
Client used 75% of the allocated DataCap. Consider allocating next tranche.
checker:manualTrigger
✔️ Storage provider distribution looks healthy.
⚠️ 40.00% of Storage Providers have retrieval success rate equal to zero.
⚠️ 40.00% of Storage Providers have retrieval success rate less than 75%.
✔️ Data replication looks healthy.
✔️ No CID sharing has been observed.
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Click here to view the CID Checker report.
@lyjmry Those two US SPs have a retrieval rate of 0%. Could you explain?
Hi @martplo We are communicating with the SP because there is a problem with their integrated boost
@lyjmry Those two US SPs have a retrieval rate of 0%. Could you explain?
We're still troubleshooting the issue. and try to solve
@lyjmry ok, keep me posted.
checker:manualTrigger
✔️ Storage provider distribution looks healthy.
⚠️ 20.00% of Storage Providers have retrieval success rate equal to zero.
⚠️ 40.00% of Storage Providers have retrieval success rate less than 75%.
✔️ Data replication looks healthy.
✔️ No CID sharing has been observed.
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Click here to view the CID Checker report.
Hello@martplo It seems that the SP has completed the repair, it just takes a while to retrieve the release and restore it.
checker:manualTrigger
✔️ Storage provider distribution looks healthy.
⚠️ 20.00% of Storage Providers have retrieval success rate equal to zero.
⚠️ 40.00% of Storage Providers have retrieval success rate less than 75%.
✔️ Data replication looks healthy.
✔️ No CID sharing has been observed.
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Click here to view the CID Checker report.
checker:manualTrigger
✔️ Storage provider distribution looks healthy.
⚠️ 40.00% of Storage Providers have retrieval success rate less than 75%.
✔️ Data replication looks healthy.
✔️ No CID sharing has been observed.
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Click here to view the CID Checker report.
It looks like SP has completed the boost service integration issue
The growth of the retrieval rate is noticeable. I'll keep observing that. The next tranche will be 768 TiB (15%).
Application is in Refill
Your Datacap Allocation Request has been approved by the Notary
bafy2bzacecg5uxkt5vkeayhh3yunv2vxnvy6vxjxirlnavozvsxgkaotck2de
Address
f1tcdiusiow3wwraen7rnmiog2akmgkccnqqxftta
Datacap Allocated
768TiB
Signer Address
f1msap4wvgzzv4xlzeq6kycmgx55ferfloxnt2rcy
Id
de6a353c-27b5-43ec-9e3f-1e48fe4e843f
You can check the status of the message here: https://filfox.info/en/message/bafy2bzacecg5uxkt5vkeayhh3yunv2vxnvy6vxjxirlnavozvsxgkaotck2de
Application is Granted
Client used 75% of the allocated DataCap. Consider allocating next tranche.
checker:manualTrigger
✔️ Storage provider distribution looks healthy.
⚠️ 12.50% of Storage Providers have retrieval success rate equal to zero.
⚠️ 62.50% of Storage Providers have retrieval success rate less than 75%.
✔️ Data replication looks healthy.
✔️ No CID sharing has been observed.
[^1]: To manually trigger this report, add a comment with text checker:manualTrigger
[^2]: Deals from those addresses are combined into this report as they are specified with checker:manualTrigger
[^3]: To manually trigger this report with deals from other related addresses, add a comment with text checker:manualTrigger <other_address_1> <other_address_2> ...
Click here to view the CID Checker report.
[f03220176] This SP is using boost services for the first time and we are teaching them how to support retrieval. And we've seen success. Maybe Spark detection takes time.
Version
2024-07-16T08:22:47.622Z
DataCap Applicant
@lyjmry
Data Owner Name
OpendataLab
Data Owner Country/Region
IT & Technology Services
Website
https://opendatalab.org.cn
Social Media Handle
oepndatalab
Social Media Type
WeChat
What is your role related to the dataset
Data Preparer
Total amount of DataCap being requested
5PiB
Expected size of single dataset (one copy)
700TiB
Number of replicas to store
7
Weekly allocation of DataCap requested
1000TiB
On-chain address for first allocation
f1tcdiusiow3wwraen7rnmiog2akmgkccnqqxftta
Data Type of Application
Public, Open Dataset (Research/Non-Profit)
Identifier
1213
Share a brief history of your project and organization
Projec:OpenDataLab, established by the Shanghai AI Lab's large model database team, is the chosen platform for Chinese Large Model Corpus Data Alliance's open data services. It offers comprehensive AI data support to developers, mitigating data processing risks and fostering AI research and applications About me: I have joined the Filecoin ecosystem since 2019, and have been active in the community since the second half of 2020, and have contributed to filecoin Greater China by providing resource information reports, resource docking, etc.
Is this project associated with other projects/ecosystem stakeholders?
No
If answered yes, what are the other projects/ecosystem stakeholders
provides high-quality open datasets for large models
Where was the data currently stored in this dataset sourced from
Other
If you answered "Other" in the previous question, enter the details here
OpenDataLab provides storage services and CLI/SDK download methods
If you are a data preparer. What is your location (Country/Region)
China
If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?
I will fully communicate with the storage providers, including downloading the data to my local hard disk for offline transactions. If bandwidth resources are sufficient due to long distances, the storage provider will download it from AWS, using including but not limited to boost
If you are not preparing the data, who will prepare the data? (Provide name and business)
N/A
Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.
Yes, I found that someone has already used it at https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/1693. But he appears to have violated community rules by not completing the plan.
Please share a sample of the data
https://opendatalab.org.cn/OpenDataLab/CityScapes/tree/main/raw https://opendatalab.org.cn/AI4Chem/ChemData700K/tree/main https://opendatalab.org.cn/OpenXDLab/RenderMe-360/tree/main/raw
Confirm that this is a public dataset that can be retrieved by anyone on the Network
Confirm
If you chose not to confirm, what was the reason
What is the expected retrieval frequency for this data
Yearly
For how long do you plan to keep this dataset stored on Filecoin
Permanently
In which geographies do you plan on making storage deals
Greater China, Asia other than Greater China
How will you be distributing your data to storage providers
Cloud storage (i.e. S3), HTTP or FTP server, Shipping hard drives
How did you find your storage providers
Partners
If you answered "Others" in the previous question, what is the tool or platform you used
Wechat
Please list the provider IDs and location of the storage providers you will be working with.
f03144077 Hong Kong f03136267 Hong Kong f03035686 Shenzhen, Guangdong, China f03068013 Hong Kong f03148950 Singapore other
How do you plan to make deals to your storage providers
Boost client, Lotus client
If you answered "Others/custom tool" in the previous question, enter the details here
Can you confirm that you will follow the Fil+ guideline
Yes