Open Zzbaoo opened 2 weeks ago
Application is waiting for allocator review
- The links provided are not to the data set
- It is unclear what data specifically is proposed to be stored, or how
- This application is meant for data that is already publicly available, "customer behavior data" is not regularly of this type.
Samples of the dataset are as follows:
https://pan.baidu.com/s/1NjRFt8JO6GnDtIjLc8GFGA?pwd=587t
https://pan.baidu.com/s/1DZbJQUTpzj-T64MHZ_Mxmw?pwd=az56
https://pan.baidu.com/s/1Mm2R_nIm3tQZ8EFL3Xdtzg?pwd=xfb2
https://pan.baidu.com/s/1Qah8AA24XNWk37aTtdd8aA?pwd=h8p8
https://pan.baidu.com/s/1MOn4gklUjjMUIKmW6VF3Aw?pwd=b79g
This dataset sample consists of surveillance videos, including retail transactions, customer interactions, and sensor data collected through our digital monitoring services. It can be used for machine learning model training, intelligent analysis, public safety, and more.
The data is currently stored on our internal distributed system. We plan to migrate this data to the Filecoin network with the following steps:
Partial screenshots
KYC has been requested. Please complete KYC at https://kyc.allocator.tech/?owner=fidlabs&repo=Open-Data-Pathway&client=f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq&issue=74
@Zzbaoo see above, we're asking you to complete a KYC humanity check. Let me know if you have any questions.
KYC completed for client address f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq
with Optimism address 0x420AdE7b30e18e4FE316954Cd7160D6DADD0014a
and passport score 20
.
@Zzbaoo Thanks you for KYC completion. Next step please answer questions about Data preparation
regarding data preparation procedures, can you please have the data preparer clarify:
Total DataCap requested
6PiB
Expected weekly DataCap usage rate
500TiB
DataCap Amount - First Tranche
50TiB
Client address
f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq
f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq
50TiB
4a495807-277b-48ce-872f-0e2077bbee50
Application is ready to sign
Once you complete the data prep questions, let me know.
Also, one note: The max first allocation for a new user is 50TiBs and after, we will support up to 5 PiBs initially. If everything meets the guidelines, we will work toward the full amount requested
@Zzbaoo Thanks you for KYC completion. Next step please answer questions about Data preparation
regarding data preparation procedures, can you please have the data preparer clarify:
- what specifically are the datasets from that site that you are committing to store? all of it, or a specified enumeration?
- what is the transformation from the files available for download and what will be stored on filecoin?
- how when we sample your deals will we be able to confirm that it has come from the dataset?
- how the data is transformed into deals for filecoin. when a deal is sampled for verification, how will we be able to confirm that it is part of this dataset? (how is is chunked into car files?)
I would like to add some nodes: the storage nodes are f03151449 (Shenzhen, China), f03151456 (Shenzhen, China), f03179555 (Singapore), f03179570 (Singapore), f03178077 (Tokyo, Japan), and f03178144 (Tokyo, Japan). I have updated these in the application form above.
Through these CIDs, you can trace the data blocks included in each transaction
How?
(OLD vs NEW)
Please list the provider IDs and location of the storage providers you will be working with: f03178077 US f03178144 SG f03178150 JP f03178158 US f03151449 CN f03151456 CN f03179555 SG f03179570 SG f03178077 JP f03178144 JP vs f03178077 US f03178144 SG f03178150 JP f03178158 US State: Submitted vs ReadyToSign
Through these CIDs, you can trace the data blocks included in each transaction
How?
You can obtain this data through tools such as HTTP and Lassie. We will create indexes for this data and ensure it meets the requirements for Spark and HTTP retrieval. The retrieved data can be restored using go-graphsplit, and the restored data's consistency can be ensured through checksums.
I download one of your deals. How do i know it's really part of your dataset? will i be able to get individual files in each individual deal, and match them back to what's provided in baidu pan? could it be the case that a deal would have only part of a file? how would i know what other deals i need to also get in order to reconstruct a file?
You haven't given me enough information to have any confidence yet in feeling like i can audit what you store.
I download one of your deals. How do i know it's really part of your dataset? will i be able to get individual files in each individual deal, and match them back to what's provided in baidu pan? could it be the case that a deal would have only part of a file? how would i know what other deals i need to also get in order to reconstruct a file?
You haven't given me enough information to have any confidence yet in feeling like i can audit what you store.
There are several ways to determine if the data you downloaded is part of our dataset. The simplest method is to match the checksum. Another method is to match the payload CID, which addresses the data segments in the CAR file. Using the payload CID is the most reliable method as it directly references the content.
You raised a concern about needing to download all related deals to reconstruct a file from a large dataset that has been split. This can be challenging, as you may not know which specific deals contain the data needed to reconstruct the original file. We address this issue as follows: When we use go-graphsplit to split the data, we do not split the entire dataset as a whole. Instead, we ensure that individual files are pre-organized into chunks of the size required by Filecoin. Since we are working with video stream data, we can easily split it in this way. As a result, you do not need to retrieve multiple deals to reconstruct a single file. Additionally, the Graphsplit tool generates a manifest.csv that maps file names, payload CIDs, piece CIDs, and the internal file structure. This allows us to verify the consistency of the stored files using the manifest and CIDs.
Many people, for convenience, split files arbitrarily, which results in incomplete files across different deals and complicates the retrieval process. This means you might need multiple deals to reconstruct a single file, and it’s often unclear from which deals to retrieve the necessary data. This practice undermines the original intent of Filecoin Plus. In contrast, we process the data in advance to ensure that each file is stored as a complete unit within a single deal. This approach ensures that when you retrieve and reconstruct data, you are working with complete files rather than fragments.
You're just answering with chatGPT. it's a lot of words but it isn't answering my concern.
On Wed, Sep 11, 2024 at 12:20 PM Zzbaoo @.***> wrote:
I download one of your deals. How do i know it's really part of your dataset? will i be able to get individual files in each individual deal, and match them back to what's provided in baidu pan? could it be the case that a deal would have only part of a file? how would i know what other deals i need to also get in order to reconstruct a file?
You haven't given me enough information to have any confidence yet in feeling like i can audit what you store.
1.
There are several ways to determine if the data you downloaded is part of our dataset. The simplest method is to match the checksum. Another method is to match the payload CID, which addresses the data segments in the CAR file. Using the payload CID is the most reliable method as it directly references the content. 2.
You raised a concern about needing to download all related deals to reconstruct a file from a large dataset that has been split. This can be challenging, as you may not know which specific deals contain the data needed to reconstruct the original file. We address this issue as follows: When we use go-graphsplit to split the data, we do not split the entire dataset as a whole. Instead, we ensure that individual files are pre-organized into chunks of the size required by Filecoin. Since we are working with video stream data, we can easily split it in this way. As a result, you do not need to retrieve multiple deals to reconstruct a single file. Additionally, the Graphsplit tool generates a manifest.csv that maps file names, payload CIDs, piece CIDs, and the internal file structure. This allows us to verify the consistency of the stored files using the manifest and CIDs. 3.
Many people, for convenience, split files arbitrarily, which results in incomplete files across different deals and complicates the retrieval process. This means you might need multiple deals to reconstruct a single file, and it’s often unclear from which deals to retrieve the necessary data. This practice undermines the original intent of Filecoin Plus. In contrast, we process the data in advance to ensure that each file is stored as a complete unit within a single deal. This approach ensures that when you retrieve and reconstruct data, you are working with complete files rather than fragments.
— Reply to this email directly, view it on GitHub https://github.com/fidlabs/Open-Data-Pathway/issues/74#issuecomment-2343515734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADIRJ6NEJVTDGWVCPHT4CLZWAYQBAVCNFSM6AAAAABNHJCNJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTGUYTKNZTGQ . You are receiving this because you commented.Message ID: @.***>
你只是用 chatGPT 回答。虽然说了很多话,但却没有回答我的疑虑。 … On Wed, Sep 11, 2024 at 12:20 PM Zzbaoo @.> wrote: I download one of your deals. How do i know it's really part of your dataset? will i be able to get individual files in each individual deal, and match them back to what's provided in baidu pan? could it be the case that a deal would have only part of a file? how would i know what other deals i need to also get in order to reconstruct a file? You haven't given me enough information to have any confidence yet in feeling like i can audit what you store. 1. There are several ways to determine if the data you downloaded is part of our dataset. The simplest method is to match the checksum. Another method is to match the payload CID, which addresses the data segments in the CAR file. Using the payload CID is the most reliable method as it directly references the content. 2. You raised a concern about needing to download all related deals to reconstruct a file from a large dataset that has been split. This can be challenging, as you may not know which specific deals contain the data needed to reconstruct the original file. We address this issue as follows: When we use go-graphsplit to split the data, we do not split the entire dataset as a whole. Instead, we ensure that individual files are pre-organized into chunks of the size required by Filecoin. Since we are working with video stream data, we can easily split it in this way. As a result, you do not need to retrieve multiple deals to reconstruct a single file. Additionally, the Graphsplit tool generates a manifest.csv that maps file names, payload CIDs, piece CIDs, and the internal file structure. This allows us to verify the consistency of the stored files using the manifest and CIDs. 3. Many people, for convenience, split files arbitrarily, which results in incomplete files across different deals and complicates the retrieval process. This means you might need multiple deals to reconstruct a single file, and it’s often unclear from which deals to retrieve the necessary data. This practice undermines the original intent of Filecoin Plus. In contrast, we process the data in advance to ensure that each file is stored as a complete unit within a single deal. This approach ensures that when you retrieve and reconstruct data, you are working with complete files rather than fragments. — Reply to this email directly, view it on GitHub <#74 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADIRJ6NEJVTDGWVCPHT4CLZWAYQBAVCNFSM6AAAAABNHJCNJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTGUYTKNZTGQ . You are receiving this because you commented.Message ID: @.>
Could you please be more specific about what you are concerned about? For example, are you worried that we will abuse the data, or that we will not handle the data well, or that the data duplication rate is too high?
We are concerned that you are applying to our Open Public Data Allocator and the data you are storing is not open source. According to our allocator rules, we need the dataset to be public and available
We are concerned that you are applying to our Open Public Data Allocator and the data you are storing is not open source. According to our allocator rules, we need the dataset to be public and available
Data Owner Name
Bond
Data Owner Country/Region
China
Data Owner Industry
Information, Media & Telecommunications
Website
http://www.gdysd.cn/
Social Media Handle
hpjnba@gmail.com
Social Media Type
Other
What is your role related to the dataset
Dataset Owner
Total amount of DataCap being requested
6PiB
Expected size of single dataset (one copy)
1PiB
Number of replicas to store
6
Weekly allocation of DataCap requested
500TiB
On-chain address for first allocation
f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq
Data Type of Application
Public, Open Commercial/Enterprise
Custom multisig
Identifier
No response
Share a brief history of your project and organization
Is this project associated with other projects/ecosystem stakeholders?
No
If answered yes, what are the other projects/ecosystem stakeholders
No response
Describe the data being stored onto Filecoin
Where was the data currently stored in this dataset sourced from
My Own Storage Infra
If you answered "Other" in the previous question, enter the details here
No response
If you are a data preparer. What is your location (Country/Region)
None
If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?
No response
If you are not preparing the data, who will prepare the data? (Provide name and business)
No response
Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.
No response
Please share a sample of the data
Confirm that this is a public dataset that can be retrieved by anyone on the Network
If you chose not to confirm, what was the reason
No response
What is the expected retrieval frequency for this data
Daily
For how long do you plan to keep this dataset stored on Filecoin
More than 3 years
In which geographies do you plan on making storage deals
Asia other than Greater China, North America
How will you be distributing your data to storage providers
Cloud storage (i.e. S3), Shipping hard drives
How did you find your storage providers
Partners
If you answered "Others" in the previous question, what is the tool or platform you used
No response
Please list the provider IDs and location of the storage providers you will be working with.
How do you plan to make deals to your storage providers
Boost client, Lotus client, Droplet client
If you answered "Others/custom tool" in the previous question, enter the details here
No response
Can you confirm that you will follow the Fil+ guideline
Yes