[DataCap Application] <RongYin> - <AI Tools-NLP> - NEW

datalove2 commented 1 year ago

Data Owner Name

RongYin

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

China

Data Owner Industry

IT & Technology Services

Website

https://www.qcc.com/firm/3380acbb3101bd58394d1ba4be51e877.html

Social Media

https://www.qcc.com/firm/3380acbb3101bd58394d1ba4be51e877.html

Total amount of DataCap being requested

7PiB

Expected size of single dataset (one copy)

1.5PiB

Number of replicas to store

10

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f15waefbgzgzjq2wlb3cqcttgfsmkldfrctiwf2jq

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

[ ] Use Custom Multisig

Identifier

No response

Share a brief history of your project and organization

RongYin was established in 2019 in HK. We were provided with a storage capacity in total of 150PiB. Now, we are planning to engage in onboard humanity data which is useful for the network. <RongYin Open Data Project> has successed onboard 10PiB storage capacity to the network, which is about 1.5P raw data. For the next steps, we have prepared 3P raw data with 10x backups.

Is this project associated with other projects/ecosystem stakeholders?

No

If answered yes, what are the other projects/ecosystem stakeholders

No response

Describe the data being stored onto Filecoin

We are going to onboard open data of <natural language processing> from AWS, which is a branch of artificial intelligence (AI) that enables computers to comprehend, generate, and manipulate human language. Natural language processing has the ability to interrogate the data with natural language text or voice.
Natural language processing datasets covers 68 matching datasets. In total about 1.7PiB. 
Including Common Crawl, Sudachi Language Resources, Japanese Tokenizer Dictionaries, MIMIC-III (‘Medical Information Mart for Intensive Care’), Common Screens, Discrete Reasoning Over the content of Paragraphs (DROP), End of Term Web Archive Dataset, MultiCoNER Datasets, etc.

Where was the data currently stored in this dataset sourced from

AWS Cloud

If you answered "Other" in the previous question, enter the details here

No response

If you are a data preparer. What is your location (Country/Region)

Hong Kong

If you are a data preparer, how will the data be prepared? Please include tooling used and technical details?

IPFS, lotus, singularity, graphsplit

If you are not preparing the data, who will prepare the data? (Provide name and business)

No response

Has this dataset been stored on the Filecoin network before? If so, please explain and make the case why you would like to store this dataset again to the network. Provide details on preparation and/or SP distribution.

no

Please share a sample of the data

https://registry.opendata.aws/eot-web-archive/
https://registry.opendata.aws/allenai-quoref/
https://registry.opendata.aws/comonscreens/
https://registry.opendata.aws/allenai-drop/
https://registry.opendata.aws/paracrawl/
https://registry.opendata.aws/allenai-quoref/
https://registry.opendata.aws/mmid/
...

Confirm that this is a public dataset that can be retrieved by anyone on the Network

[X] I confirm

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Yearly

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, North America, Europe

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), HTTP or FTP server, IPFS, Shipping hard drives

How do you plan to choose storage providers

Slack, Big Data Exchange, Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

I will fill in the information disclosure form in detail

How do you plan to make deals to your storage providers

Boost client, Lotus client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

herrehesse commented 1 year ago

Based on the discussion at this link and the decisions from T&T calls a few months ago, we do not permit merged datacap requests.

@datalove2, you're welcome to request these datasets individually. However, given that these datasets have been stored on the Filecoin network several times already, it might be challenging to garner support.

We encourage you to contribute meaningfully to the Filecoin network instead of solely seeking to acquire datacap.

Sunnyiscoming commented 1 year ago

Best practice for storing large datasets includes ideally, storing it in 3 or more regions, with 4 or more storage provider operators or owners.You should list Miner ID, Business Entity, Location of sps you will cooperate with.
Per the https://github.com/filecoin-project/notary-governance/issues/922 for Open, Public Dataset applicants, please complete the following Fil+ registration form to identify yourself as the applicant and also please add the contact information of the SP entities you are working with to store copies of the data.

cryptowhizzard commented 1 year ago

I am not satisfied with the explanation given in the last LDN by this client here:

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2050#issuecomment-1689352475

Although this LDN is closed the applicant blames his SP for wrongdoing. This is a fairytale.

We all know that the Data preparer sends out the deal. This unique deal ID can only match with the original content of the data preparer. This is the whole meaning of Filecoin , immutability.

I propose to close this application and new ones until a proper explanation is given. Fraud brings harm to the community and this behaviour must be stopped.

Sunnyiscoming commented 1 year ago

Any update here?

Sunnyiscoming commented 1 year ago

Any update here?

datalove2 commented 1 year ago

@Sunnyiscoming RG mentioned that merging datasets cannot be submitted as a project anymore. However, the AI project is a distinct category and does not fall under the dataset merging scope. Additionally, we closed the original application and re-applied based on comments from the DCENT team during the Notary Node meeting. Despite providing responses, they continue to be persistent in their queries. The data we store is from publicly available datasets, and this storage is not considered meaningless for the network. Furthermore, we have resubmitted the certification form for your review.

Sunnyiscoming commented 1 year ago

Please list information of sps here.

datalove2 commented 1 year ago

@Sunnyiscoming 7E51352E-6F6D-43D6-A3D1-4A81064CC77D

cryptowhizzard commented 1 year ago

I am not satisfied with the explanation given in the last LDN by this client here:

#2050 (comment)

Although this LDN is closed the applicant blames his SP for wrongdoing. This is a fairytale.

We all know that the Data preparer sends out the deal. This unique deal ID can only match with the original content of the data preparer. This is the whole meaning of Filecoin , immutability.

I propose to close this application and new ones until a proper explanation is given. Fraud brings harm to the community and this behaviour must be stopped.

Reminder.

Sunnyiscoming commented 1 year ago

https://github.com/filecoin-project/filecoin-plus-large-datasets/issues/2169#issuecomment-1705020103 Please explain why the AI project is a distinct category and does not fall under the dataset merging scope. Perhaps you need to provide a more convincing reason for the previous problem.

Sunnyiscoming commented 1 year ago

Any update here?

Sunnyiscoming commented 1 year ago

It will be closed in 3 days if there is no reply here.

datalove2 commented 1 year ago

@Sunnyiscoming I apologize for the delayed response.

The community discourages applications for "merged data sets" because combining a large number of unrelated data into a single large file can result in messy, difficult-to-use uploads that offer limited value. However, what we are applying for is a category of data labeled "natural language processing," which falls under the same category and serves the same purpose.

Sunnyiscoming commented 1 year ago

I cannot support your application. There is clear rule in the community. Maybe you can apply for other open dataset.

github-actions[bot] commented 1 year ago

This application has not seen any responses in the last 10 days. This issue will be marked with Stale label and will be closed in 4 days. Comment if you want to keep this application open.

-- Commented by Stale Bot.

filecoin-project / filecoin-plus-large-datasets