filecoin-project / filecoin-plus-large-datasets

Hub for client applications for DataCap at a large scale
110 stars 62 forks source link

[DataCap Application] <LaughStorage> - <Experimental Research of sentiment analysis in Turkish OpenData> #2080

Closed 26dos closed 1 year ago

26dos commented 1 year ago

Data Owner Name

LaughStorage

What is your role related to the dataset

Data Preparer

Data Owner Country/Region

China

Data Owner Industry

IT & Technology Services

Website

https://github.com/FurkanGozukara/Sentiment-Analysis

Social Media

na

Total amount of DataCap being requested

15PiB

Expected size of single dataset (one copy)

2P

Number of replicas to store

8

Weekly allocation of DataCap requested

1PiB

On-chain address for first allocation

f1mkfdzrhzrjecnxx2slfvoubcf4w7ug2i2jqpmza

Data Type of Application

Public, Open Dataset (Research/Non-Profit)

Custom multisig

Identifier

No response

Share a brief history of your project and organization

I joined the filecoin network in 2021, and made the cc package of fil at the very beginning.  In 2022, we did a part of cc to dc conversion, and now we have a planned continuous development of data storage. In 2023, I established a technical service company  <Laughstorage> to make in-depth investment in the distributed storage track.

Is this project associated with other projects/ecosystem stakeholders?

Yes

If answered yes, what are the other projects/ecosystem stakeholders

Some old SPs

Describe the data being stored onto Filecoin

A repository for datasets which are used in sentiment analysis research studies. Datasets for sentiment analysis, datasets for text classification.
Based on Turkish local networks, movies and products, record the corresponding emotional feedback. Record and analyze. This file contains the original material, as well as the message corresponding to the material, and the analysis record.

Where was the data currently stored in this dataset sourced from

My Own Storage Infra

If you answered "Other" in the previous question, enter the details here

No response

How do you plan to prepare the dataset

IPFS, lotus

If you answered "other/custom tool" in the previous question, enter the details here

No response

Please share a sample of the data

https://github.com/FurkanGozukara/Sentiment-Analysis/blob/master/Turkish_Product_Reviews_by_Demirtas_and_Pechenizkiy_2013/Books%20%2B%20DVD%20%2B%20Electronics%20%2B%20Kitchen/Cross_Validation/test_1/Test_Set_For_Weka.arff
https://github.com/FurkanGozukara/Sentiment-Analysis/blob/master/Turkish_Product_Reviews_by_Demirtas_and_Pechenizkiy_2013/Books%20%2B%20DVD%20%2B%20Electronics%20%2B%20Kitchen/Cross_Validation/test_1/Test_Set_Negative_Processed.txt
https://github.com/FurkanGozukara/Sentiment-Analysis/tree/master/Turkish_Product_Reviews_by_Demirtas_and_Pechenizkiy_2013/Books%20%2B%20DVD%20%2B%20Electronics%20%2B%20Kitchen/Cross_Validation/test_10
https://github.com/FurkanGozukara/Sentiment-Analysis/tree/master/Turkish_Product_Reviews_by_Demirtas_and_Pechenizkiy_2013/Books%20%2B%20DVD%20%2B%20Electronics%20%2B%20Kitchen/Cross_Validation/test_2
https://github.com/FurkanGozukara/Sentiment-Analysis/tree/master/English_Product_Reviews_by_Blitzer_et_al_2007/Books%20%2B%20DVD%20%2B%20Electronics%20%2B%20Kitchen

Confirm that this is a public dataset that can be retrieved by anyone on the Network

If you chose not to confirm, what was the reason

No response

What is the expected retrieval frequency for this data

Sporadic

For how long do you plan to keep this dataset stored on Filecoin

1.5 to 2 years

In which geographies do you plan on making storage deals

Greater China, Asia other than Greater China, North America, Europe

How will you be distributing your data to storage providers

Cloud storage (i.e. S3), HTTP or FTP server, IPFS, Shipping hard drives, Lotus built-in data transfer

How do you plan to choose storage providers

Partners

If you answered "Others" in the previous question, what is the tool or platform you plan to use

No response

If you already have a list of storage providers to work with, fill out their names and provider IDs below

f02148382/f02115125/f01043193/f02199393and more are connecting.
Actually is base on their collateral.

How do you plan to make deals to your storage providers

Boost client

If you answered "Others/custom tool" in the previous question, enter the details here

No response

Can you confirm that you will follow the Fil+ guideline

Yes

large-datacap-requests[bot] commented 1 year ago

Thanks for your request! Everything looks good. :ok_hand:

A Governance Team member will review the information provided and contact you back pretty soon.

herrehesse commented 1 year ago

I would like to see evidence on the 2 PiB expected size of single dataset (one copy), can you help me with that?

26dos commented 1 year ago

Sure, as for now, There are about 600T .tgz files have been downloaded. After processing there will be at least 1.7p. What's more, coordinating files are still being downloaded now. Looking forward to your support. TY.

微信图片_20230630140019 微信图片_20230630140024 微信图片_20230630135956 微信图片_20230630140011

herrehesse commented 1 year ago

@26dos Thank you for your insights, let me take a look at those folders and come back to you as soon as possible.

herrehesse commented 1 year ago

@26dos I can find only .txt files in the GitHub repository and to my best analysis the the total size of the folders and their files is 1.37 GB. The largest folder is the "data" folder, which contains the three CSV files. The smallest folder is the "notebooks" folder, which contains only the "Instructions.txt" file.

Can you link me to the locations of the "1.7P" of files and their contents?

26dos commented 1 year ago

All the content contained in this githubrepo is kept locally by us. This dataset aims to study the sentiment analysis of the reviews of movies and online products in the Turkish region. As I mentioned in the data description, all the original material mentioned in the original file, we found the corresponding material on the Internet according to the automatic matching script, and then, the original material and the corresponding comments and corresponding sentiment analysis conclusions to match. What you see is the corresponding original material that we found. Scripts may not be shared with everyone. thank you for understanding

cryptowhizzard commented 1 year ago

Hi @26dos

I have a question here.

In the dataset description you mention that the owner of the data is China. However "This dataset aims to study the sentiment analysis of the reviews of movies and online products in the Turkish region."

What exactly are you going to store then? Movies and other materials are copyright protected?

26dos commented 1 year ago

I'm the data preparer, not the data owner. I chose this dataset because it was interesting to see this research. I think it is valuable to save relevant content. The downloaded relevant materials are all public data content. Anyone else can pay attention to my application? @Sunnyiscoming

cryptowhizzard commented 1 year ago

Do i understand correct then that you will download movies, do analyze on them, and then store that analyzed result on the Filecoin network? So , as data preparer you are also doing the scientific work?

26dos commented 1 year ago

Incorrect, the platform's data is already analytical data for content such as movies, but analytical data alone we don't think is particularly valuable, we build on this by matching publicly available movie resources, matching movies with the analytical data the platform has already done, and storing both to IPFS.

Sunnyiscoming commented 1 year ago

Best practice for storing large datasets includes ideally, storing it in 3 or more regions, with 4 or more storage provider operators or owners. You should list Miner ID, Business Entity, Location of sps you will cooperate with.

herrehesse commented 1 year ago

@26dos Storing "movies" onto filecoin using the filecoin+ program multiplier is not allowed.

cryptowhizzard commented 1 year ago

Incorrect, the platform's data is already analytical data for content such as movies, but analytical data alone we don't think is particularly valuable, we build on this by matching publicly available movie resources, matching movies with the analytical data the platform has already done, and storing both to IPFS.

Hi,

Ok, thanks for explaining. I am ok with the analytical data but movies are copyright protected. You can't store that material on Filecoin I think.