We want to contribute knowledge and source codes to develop an active Vietnamese Natural Language Processing community. Beginners don't have to build basic code again. They can use code for implementing in their system or doing research for improving performance.
Function: Convert text into tokens E.g.
from tokenization.dict_models import LongMatchingTokenizer
lm_tokenizer = LongMatchingTokenizer()
tokens = lm_tokenizer.tokenize("Thuế thu nhập cá nhân")
Output will be an array: ['Thuế_thu_nhập', 'cá_nhân']
Example 2: Using crf
from tokenization.crf_tokenizer import CrfTokenizer
crf_tokenizer_obj = CrfTokenizer()
crf_tokenizer_obj.train('data/tokenized/samples/training')
# Note: If you trained your model, please set correct model path and do not train again!
# crf_tokenizer_obj = CrfTokenizer(model_path='models/pretrained_tokenizer.crfsuite')
test_sent = "Thuế thu nhập cá nhân"
tokenized_sent = crf_tokenizer_obj.get_tokenized(test_sent)
print(tokenized_sent)
Output will be an array: ['Thuế_thu_nhập', 'cá_nhân']
Download data from websites such as Wikipedia, Vietnamese News. You can use any method and here we propose one for you:
from word_embedding.utils import download_html
url_path = "https://dantri.com.vn/su-kien/anh-huong-bao....htm"
output_path = "data/word_embedding/real/html/html_data.txt"
download_html(url_path, output_path, should_clean=True)
Clean and tokenize text in files. We propose a method to clean each file in a directory.
input_dir = 'data/word_embedding/real/html'
output_dir = 'data/word_embedding/real/training'
from tokenization.crf_tokenizer import CrfTokenizer
from word_embedding.utils import clean_files_from_dir
crf_config_root_path = "tokenization/"
crf_model_path = "models/pretrained_tokenizer.crfsuite"
tokenizer = CrfTokenizer(config_root_path=crf_config_root_path, model_path=crf_model_path)
clean_files_from_dir(input_dir, output_dir, should_tokenize=True, tokenizer=tokenizer)
Training
from word_embedding.word2vec_gensim import train
data_path="data/word_embedding/samples/training"
model_path="models/word2vec.model"
train(data_path=data_path, model_path=model_path)
Test
from word_embedding.word2vec_gensim import test
model_path = "models/word2vec.model"
test_word = "thu_nhập"
test(model_path=model_path, word=test_word)
Use pre-trained word2vec model
model_path = "models/pretrained_word2vec.bin"
Other links for pre-trained word2vec models: https://github.com/sonvx/word2vecVN https://github.com/Kyubyong/wordvectors
from tokenization.crf_tokenizer import CrfTokenizer
from word_embedding.word2vec_gensim import Word2Vec
from text_classification.short_text_classifiers import BiDirectionalLSTMClassifier, load_synonym_dict
# Please give the correct paths
word2vec_model = Word2Vec.load('models/pretrained_word2vec.bin')
tokenizer = CrfTokenizer(config_root_path='tokenization/',
model_path='models/pretrained_tokenizer.crfsuite')
sym_dict = load_synonym_dict('data/sentiment/synonym.txt')
keras_text_classifier = BiDirectionalLSTMClassifier(tokenizer=tokenizer, word2vec=word2vec_model.wv,
model_path='models/sentiment_model.h5',
max_length=10, n_epochs=10,
sym_dict=sym_dict)
X, y = keras_text_classifier.load_data(['data/sentiment/samples/positive.txt',
'data/sentiment/samples/negative.txt'],
load_method=keras_text_classifier.load_data_from_file)
keras_text_classifier.train(X, y)
label_dict = {0: 'tích cực', 1: 'tiêu cực'}
test_sentences = ['Dở thế', 'Hay thế', 'phim chán thật', 'nhảm quá']
labels = keras_text_classifier.classify(test_sentences, label_dict=label_dict)
print(labels) # Output: ['tiêu cực', 'tích cực', 'tiêu cực', 'tiêu cực']
Please check requirments.txt
For tokenization modules:
For word_embedding vector:
For text classification
FYI: sklearn-crfsuite is a wrapper of python-crfsuite. sklearn-crfsuite offers scikit-learn styled API with grid-search for hyper-parameters optimization