This repository is my research project, and it is also a study of TensorFlow, Deep Learning (Fasttext, CNN, LSTM, etc.).
The main objective of the project is to solve the multi-label text classification problem based on Deep Neural Networks. Thus, the format of the data label is like [0, 1, 0, ..., 1, 1] according to the characteristics of such a problem.
The project structure is below:
.
├── Model
│ ├── test_model.py
│ ├── text_model.py
│ └── train_model.py
├── data
│ ├── word2vec_100.model.* [Need Download]
│ ├── Test_sample.json
│ ├── Train_sample.json
│ └── Validation_sample.json
└── utils
│ ├── checkmate.py
│ ├── data_helpers.py
│ └── param_parser.py
├── LICENSE
├── README.md
└── requirements.txt
jieba
or nltk
).gensim
). metadata.tsv
first).train.py
.train.py
and test.py
.test.py
.data_helpers.py
.logging
for helping to record the whole info (including parameters display, model training info, etc.).checkmate.py
, whereas the tf.train.Saver
can only save the last n checkpoints.See data format in /data
folder which including the data sample files. For example:
{"testid": "3935745", "features_content": ["pore", "water", "pressure", "metering", "device", "incorporating", "pressure", "meter", "force", "meter", "influenced", "pressure", "meter", "device", "includes", "power", "member", "arranged", "control", "pressure", "exerted", "pressure", "meter", "force", "meter", "applying", "overriding", "force", "pressure", "meter", "stop", "influence", "force", "meter", "removing", "overriding", "force", "pressure", "meter", "influence", "force", "meter", "resumed"], "labels_index": [526, 534, 411], "labels_num": 3}
You can use nltk
package if you are going to deal with the English text data.
You can use jieba
package if you are going to deal with the Chinese text data.
This repository can be used in other datasets (text classification) in two ways:
data_helpers.py
.Anyway, it should depend on what your data and task are.
🤔Before you open the new issue about the data format, please check the data_sample.json
and read the other open issues first, because someone maybe ask me the same question already. For example:
You can download the Word2vec model file (dim=100). Make sure they are unzipped and under the /data
folder.
You can pre-training your word vectors (based on your corpus) in many ways:
gensim
package to pre-train data.glove
tools to pre-train data.See Usage.
References:
References:
References:
Warning: Model can use but not finished yet 🤪!
References:
References:
References:
References:
Warning: Model can use but not finished yet 🤪!
References:
黄威,Randolph
SCU SE Bachelor; USTC CS Ph.D.
Email: chinawolfman@hotmail.com
My Blog: randolph.pro
LinkedIn: randolph's linkedin