Closed Amanda2024 closed 5 years ago
check this session on README.md
Sample data: cached file
to help you run this repository, currently we re-generate training/validation/test data and vocabulary/labels, and saved
them as cache file using h5py. we suggest you to download it from above link.
it contain everything you need to run this repository: data is pre-processed, you can start to train the model in a minute.
it's a zip file about 1.8G, contains 3 million training data. although after unzip it's quite big, but with the help of
hdf5, it only need a normal size of memory of computer(e.g.8 G or less) during training.
we use jupyter notebook: pre-processing.ipynb to pre-process data. you can have a better understanding of this task and
data by taking a look of it. you can also generate data by yourself in the way your want, just change few lines of code
using this jupyter notebook.
If you want to try a model now, you can dowload cached file from above, then go to folder 'a02_TextCNN', run
python p7_TextCNN_train.py it will use data from cached files to train the model, and print loss and F1 score periodically.
old sample data source: if you need some sample data and word embedding per-trained on word2vec, you can find it in closed issues, such as: issue 3.
you can also find some sample data at folder "data". it contains two files:'sample_single_label.txt', contains 50k data
with single label; 'sample_multiple_label.txt', contains 20k data with multiple labels. input and label of is separate by " label".
if you want to know more detail about data set of text classification or task these models can be used, one of choose is below:
Could you please offer the download address about the datasets , sometimes I can't understand the format of the data…… I want to have a reference...