brightmart / text_classification

all kinds of text classification models and more with deep learning
MIT License
7.83k stars 2.57k forks source link

How to download the dataset? #96

Closed Amanda2024 closed 5 years ago

Amanda2024 commented 5 years ago

Could you please offer the download address about the datasets , sometimes I can't understand the format of the data…… I want to have a reference...

brightmart commented 5 years ago

check this session on README.md

Sample data: cached file

to help you run this repository, currently we re-generate training/validation/test data and vocabulary/labels, and saved

them as cache file using h5py. we suggest you to download it from above link.

it contain everything you need to run this repository: data is pre-processed, you can start to train the model in a minute.

it's a zip file about 1.8G, contains 3 million training data. although after unzip it's quite big, but with the help of

hdf5, it only need a normal size of memory of computer(e.g.8 G or less) during training.

we use jupyter notebook: pre-processing.ipynb to pre-process data. you can have a better understanding of this task and

data by taking a look of it. you can also generate data by yourself in the way your want, just change few lines of code

using this jupyter notebook.

If you want to try a model now, you can dowload cached file from above, then go to folder 'a02_TextCNN', run

python p7_TextCNN_train.py it will use data from cached files to train the model, and print loss and F1 score periodically.

old sample data source: if you need some sample data and word embedding per-trained on word2vec, you can find it in closed issues, such as: issue 3.

you can also find some sample data at folder "data". it contains two files:'sample_single_label.txt', contains 50k data

with single label; 'sample_multiple_label.txt', contains 20k data with multiple labels. input and label of is separate by " label".

if you want to know more detail about data set of text classification or task these models can be used, one of choose is below: