fidlabs / Open-Data-Pathway

3 stars 8 forks source link

[DataCap Application] <Cloud Era Cloud Monitoring> - <AI Intelligent Recognition> #74

Open Zzbaoo opened 2 weeks ago

Zzbaoo commented 2 weeks ago

Data Owner Name

Bond

Data Owner Country/Region

China

Data Owner Industry

Information, Media & Telecommunications

Website

http://www.gdysd.cn/

Social Media Handle

hpjnba@gmail.com

Social Media Type

Other

What is your role related to the dataset

Dataset Owner

Total amount of DataCap being requested

6PiB

Expected size of single dataset (one copy)

1PiB

Number of replicas to store

6

Weekly allocation of DataCap requested

500TiB

On-chain address for first allocation

f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq

Data Type of Application

Public, Open Commercial/Enterprise

Custom multisig

Identifier

No response

Share a brief history of your project and organization

Cloud Era Cloud Monitoring Project:
Currently, we are a professional digital upgrade service provider and operator for the retail industry. Our focus is on providing technological upgrades and solutions to address pain points in China's 5.4 trillion yuan physical retail market, offering integrated marketing, transaction, management, and revenue enhancement solutions for convenience stores and branded chain enterprises.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

Machine learning models and training data, image and video data, sensor data, customer behavior data, etc.

Where was the data currently stored in this dataset sourced from

My Own Storage Infra

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

None

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

No response

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

No response

Please share a sample of the data

http://www.gdysd.cn/
http://www.gdysd.cn/h-col-103.html
http://www.gdysd.cn/h-col-104.html
http://www.gdysd.cn/h-col-105.html

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Daily

For how long do you plan to keep this dataset stored on Filecoin

More than 3 years

In which geographies do you plan on making storage deals

Asia other than Greater China, North America

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), Shipping hard drives

How did you find your storage providers

Partners

If you answered "Others" in the previous question, what is the tool or platform you used

No response

Please list the provider IDs and location of the storage providers you will be working with.

f03178077 US
f03178144 SG
f03178150 JP
f03178158 US
f03151449 CN
f03151456 CN
f03179555 SG
f03179570 SG
f03178077 JP
f03178144 JP

How do you plan to make deals to your storage providers

Boost client, Lotus client, Droplet client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

datacap-bot[bot] commented 2 weeks ago

Application is waiting for allocator review

willscott commented 2 weeks ago
Zzbaoo commented 2 weeks ago
  • The links provided are not to the data set
  • It is unclear what data specifically is proposed to be stored, or how
  • This application is meant for data that is already publicly available, "customer behavior data" is not regularly of this type.

Samples of the dataset are as follows:
https://pan.baidu.com/s/1NjRFt8JO6GnDtIjLc8GFGA?pwd=587t https://pan.baidu.com/s/1DZbJQUTpzj-T64MHZ_Mxmw?pwd=az56 https://pan.baidu.com/s/1Mm2R_nIm3tQZ8EFL3Xdtzg?pwd=xfb2 https://pan.baidu.com/s/1Qah8AA24XNWk37aTtdd8aA?pwd=h8p8 https://pan.baidu.com/s/1MOn4gklUjjMUIKmW6VF3Aw?pwd=b79g

This dataset sample consists of surveillance videos, including retail transactions, customer interactions, and sensor data collected through our digital monitoring services. It can be used for machine learning model training, intelligent analysis, public safety, and more.

The data is currently stored on our internal distributed system. We plan to migrate this data to the Filecoin network with the following steps:

  1. Data Export and Format Conversion: Export the raw video data from the internal system and convert it into a format suitable for storage on the Filecoin network.
  2. Data Encryption and Verification: Use end-to-end encryption techniques to encrypt the data to ensure its security during transmission. Before and after migration, hash verification methods will be employed to ensure data integrity.
  3. Upload to Filecoin: Upload the data to the Filecoin network using specialized client tools such as the Lotus client or Boost client. Once the data upload is complete, anyone can access it by retrieving the CID. All data will be stored on the Filecoin network in accordance with the Fil+ program's public data storage standards and will be openly accessible to the public.

Partial screenshots 3

datacap-bot[bot] commented 1 week ago

KYC has been requested. Please complete KYC at https://kyc.allocator.tech/?owner=fidlabs&repo=Open-Data-Pathway&client=f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq&issue=74

kevzak commented 1 week ago

@Zzbaoo see above, we're asking you to complete a KYC humanity check. Let me know if you have any questions.

datacap-bot[bot] commented 1 week ago

KYC completed for client address f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq with Optimism address 0x420AdE7b30e18e4FE316954Cd7160D6DADD0014a and passport score 20.

kevzak commented 4 days ago

@Zzbaoo Thanks you for KYC completion. Next step please answer questions about Data preparation

regarding data preparation procedures, can you please have the data preparer clarify:

datacap-bot[bot] commented 4 days ago

Datacap Request Trigger

Total DataCap requested

6PiB

Expected weekly DataCap usage rate

500TiB

DataCap Amount - First Tranche

50TiB

Client address

f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq

datacap-bot[bot] commented 4 days ago

DataCap Allocation requested

Multisig Notary address

Client address

f1tefyfqqxps3rbh4qc3xezxvze7lsd6yinipnhxq

DataCap allocation requested

50TiB

Id

4a495807-277b-48ce-872f-0e2077bbee50

datacap-bot[bot] commented 4 days ago

Application is ready to sign

kevzak commented 4 days ago

Once you complete the data prep questions, let me know.

Also, one note: The max first allocation for a new user is 50TiBs and after, we will support up to 5 PiBs initially. If everything meets the guidelines, we will work toward the full amount requested

Zzbaoo commented 4 days ago

@Zzbaoo Thanks you for KYC completion. Next step please answer questions about Data preparation

regarding data preparation procedures, can you please have the data preparer clarify:

  • what specifically are the datasets from that site that you are committing to store? all of it, or a specified enumeration?
  • what is the transformation from the files available for download and what will be stored on filecoin?
  • how when we sample your deals will we be able to confirm that it has come from the dataset?
  • how the data is transformed into deals for filecoin. when a deal is sampled for verification, how will we be able to confirm that it is part of this dataset? (how is is chunked into car files?)
  1. We will commit to storing specific listed datasets rather than all the data. These datasets will be filtered and organized, clearly identifying which files need to be uploaded to Filecoin for storage.
  2. We will use go-graphsplit to process the data. Downloadable files will first be split and packaged into CAR (Content Addressable Archive) files. These CAR files are content-addressable, with their hash values, i.e., CIDs, serving as unique identifiers for the data. The files remain unchanged during packaging, ensuring that the original data is completely preserved.
  3. We use CIDs (Content Identifiers) as unique identifiers for each file, allowing you to verify whether the stored data comes from a specific dataset. During sampling, you can verify the match between CIDs and the original dataset to confirm that the data in the transaction indeed originates from the specified dataset.
  4. Before being stored on Filecoin, data is split into smaller chunks and packaged into CAR files, with each chunk generating a unique CID based on its content. Through these CIDs, you can trace the data blocks included in each transaction. During sampling validation, you can retrieve the corresponding CIDs to confirm whether the data blocks belong to the specific dataset, ensuring the dataset's integrity and consistency.

I would like to add some nodes: the storage nodes are f03151449 (Shenzhen, China), f03151456 (Shenzhen, China), f03179555 (Singapore), f03179570 (Singapore), f03178077 (Tokyo, Japan), and f03178144 (Tokyo, Japan). I have updated these in the application form above.

willscott commented 4 days ago

Through these CIDs, you can trace the data blocks included in each transaction

How?

datacap-bot[bot] commented 4 days ago

Issue has been modified. Changes below:

(OLD vs NEW)

Please list the provider IDs and location of the storage providers you will be working with: f03178077 US f03178144 SG f03178150 JP f03178158 US f03151449 CN f03151456 CN f03179555 SG f03179570 SG f03178077 JP f03178144 JP vs f03178077 US f03178144 SG f03178150 JP f03178158 US State: Submitted vs ReadyToSign

Zzbaoo commented 3 days ago

Through these CIDs, you can trace the data blocks included in each transaction

How?

You can obtain this data through tools such as HTTP and Lassie. We will create indexes for this data and ensure it meets the requirements for Spark and HTTP retrieval. The retrieved data can be restored using go-graphsplit, and the restored data's consistency can be ensured through checksums.

willscott commented 3 days ago

I download one of your deals. How do i know it's really part of your dataset? will i be able to get individual files in each individual deal, and match them back to what's provided in baidu pan? could it be the case that a deal would have only part of a file? how would i know what other deals i need to also get in order to reconstruct a file?

You haven't given me enough information to have any confidence yet in feeling like i can audit what you store.

Zzbaoo commented 3 days ago

I download one of your deals. How do i know it's really part of your dataset? will i be able to get individual files in each individual deal, and match them back to what's provided in baidu pan? could it be the case that a deal would have only part of a file? how would i know what other deals i need to also get in order to reconstruct a file?

You haven't given me enough information to have any confidence yet in feeling like i can audit what you store.

  1. There are several ways to determine if the data you downloaded is part of our dataset. The simplest method is to match the checksum. Another method is to match the payload CID, which addresses the data segments in the CAR file. Using the payload CID is the most reliable method as it directly references the content.

  2. You raised a concern about needing to download all related deals to reconstruct a file from a large dataset that has been split. This can be challenging, as you may not know which specific deals contain the data needed to reconstruct the original file. We address this issue as follows: When we use go-graphsplit to split the data, we do not split the entire dataset as a whole. Instead, we ensure that individual files are pre-organized into chunks of the size required by Filecoin. Since we are working with video stream data, we can easily split it in this way. As a result, you do not need to retrieve multiple deals to reconstruct a single file. Additionally, the Graphsplit tool generates a manifest.csv that maps file names, payload CIDs, piece CIDs, and the internal file structure. This allows us to verify the consistency of the stored files using the manifest and CIDs.

  3. Many people, for convenience, split files arbitrarily, which results in incomplete files across different deals and complicates the retrieval process. This means you might need multiple deals to reconstruct a single file, and it’s often unclear from which deals to retrieve the necessary data. This practice undermines the original intent of Filecoin Plus. In contrast, we process the data in advance to ensure that each file is stored as a complete unit within a single deal. This approach ensures that when you retrieve and reconstruct data, you are working with complete files rather than fragments.

willscott commented 3 days ago

You're just answering with chatGPT. it's a lot of words but it isn't answering my concern.

On Wed, Sep 11, 2024 at 12:20 PM Zzbaoo @.***> wrote:

I download one of your deals. How do i know it's really part of your dataset? will i be able to get individual files in each individual deal, and match them back to what's provided in baidu pan? could it be the case that a deal would have only part of a file? how would i know what other deals i need to also get in order to reconstruct a file?

You haven't given me enough information to have any confidence yet in feeling like i can audit what you store.

1.

There are several ways to determine if the data you downloaded is part of our dataset. The simplest method is to match the checksum. Another method is to match the payload CID, which addresses the data segments in the CAR file. Using the payload CID is the most reliable method as it directly references the content. 2.

You raised a concern about needing to download all related deals to reconstruct a file from a large dataset that has been split. This can be challenging, as you may not know which specific deals contain the data needed to reconstruct the original file. We address this issue as follows: When we use go-graphsplit to split the data, we do not split the entire dataset as a whole. Instead, we ensure that individual files are pre-organized into chunks of the size required by Filecoin. Since we are working with video stream data, we can easily split it in this way. As a result, you do not need to retrieve multiple deals to reconstruct a single file. Additionally, the Graphsplit tool generates a manifest.csv that maps file names, payload CIDs, piece CIDs, and the internal file structure. This allows us to verify the consistency of the stored files using the manifest and CIDs. 3.

Many people, for convenience, split files arbitrarily, which results in incomplete files across different deals and complicates the retrieval process. This means you might need multiple deals to reconstruct a single file, and it’s often unclear from which deals to retrieve the necessary data. This practice undermines the original intent of Filecoin Plus. In contrast, we process the data in advance to ensure that each file is stored as a complete unit within a single deal. This approach ensures that when you retrieve and reconstruct data, you are working with complete files rather than fragments.

— Reply to this email directly, view it on GitHub https://github.com/fidlabs/Open-Data-Pathway/issues/74#issuecomment-2343515734, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADIRJ6NEJVTDGWVCPHT4CLZWAYQBAVCNFSM6AAAAABNHJCNJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTGUYTKNZTGQ . You are receiving this because you commented.Message ID: @.***>

Zzbaoo commented 3 days ago

你只是用 chatGPT 回答。虽然说了很多话,但却没有回答我的疑虑。 On Wed, Sep 11, 2024 at 12:20 PM Zzbaoo @.> wrote: I download one of your deals. How do i know it's really part of your dataset? will i be able to get individual files in each individual deal, and match them back to what's provided in baidu pan? could it be the case that a deal would have only part of a file? how would i know what other deals i need to also get in order to reconstruct a file? You haven't given me enough information to have any confidence yet in feeling like i can audit what you store. 1. There are several ways to determine if the data you downloaded is part of our dataset. The simplest method is to match the checksum. Another method is to match the payload CID, which addresses the data segments in the CAR file. Using the payload CID is the most reliable method as it directly references the content. 2. You raised a concern about needing to download all related deals to reconstruct a file from a large dataset that has been split. This can be challenging, as you may not know which specific deals contain the data needed to reconstruct the original file. We address this issue as follows: When we use go-graphsplit to split the data, we do not split the entire dataset as a whole. Instead, we ensure that individual files are pre-organized into chunks of the size required by Filecoin. Since we are working with video stream data, we can easily split it in this way. As a result, you do not need to retrieve multiple deals to reconstruct a single file. Additionally, the Graphsplit tool generates a manifest.csv that maps file names, payload CIDs, piece CIDs, and the internal file structure. This allows us to verify the consistency of the stored files using the manifest and CIDs. 3. Many people, for convenience, split files arbitrarily, which results in incomplete files across different deals and complicates the retrieval process. This means you might need multiple deals to reconstruct a single file, and it’s often unclear from which deals to retrieve the necessary data. This practice undermines the original intent of Filecoin Plus. In contrast, we process the data in advance to ensure that each file is stored as a complete unit within a single deal. This approach ensures that when you retrieve and reconstruct data, you are working with complete files rather than fragments. — Reply to this email directly, view it on GitHub <#74 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADIRJ6NEJVTDGWVCPHT4CLZWAYQBAVCNFSM6AAAAABNHJCNJOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNBTGUYTKNZTGQ . You are receiving this because you commented.Message ID: @.>

Could you please be more specific about what you are concerned about? For example, are you worried that we will abuse the data, or that we will not handle the data well, or that the data duplication rate is too high?

martapiekarska commented 3 days ago

We are concerned that you are applying to our Open Public Data Allocator and the data you are storing is not open source. According to our allocator rules, we need the dataset to be public and available

Zzbaoo commented 3 days ago

We are concerned that you are applying to our Open Public Data Allocator and the data you are storing is not open source. According to our allocator rules, we need the dataset to be public and available

  1. We are working on an AI intelligent cloud monitoring project. The data comes from a cloud monitoring platform and captures public areas.
  2. The purpose of storing the monitoring information is for AI behavior training, such as detecting theft or scanning for buying and selling activities.
  3. We have the necessary documentation, including business certification, and all videos have been registered.