davidkorea / MEACHINE_LEARNING

0 stars 0 forks source link

Word2Vector, CountVectorizer, TfidfTransformer #7

Open davidkorea opened 6 years ago

davidkorea commented 6 years ago

Word2vector

0. Preparation

  1. raw text
    text = """
           稀疏矩阵是由大部分为零的矩阵组成的矩阵,
           这是和稠密矩阵有所区别的主要特点。
           """
    # must wrap row by enter \n, or else cannot split
  2. split text to list
    sentence_list = text.split()
    # ['稀疏矩阵是由大部分为零的矩阵组成的矩阵,', '这是和稠密矩阵有所区别的主要特点。']
  3. jieba cut

    corpus_list = [jieba.lcut(i) for i in sentence_list]
    # [['稀疏', '矩阵', '是', '由', '大部分', '为', '零', '的', '矩阵', '组成', '的', '矩阵', ','],
    #  ['这', '是', '和', '稠密', '矩阵', '有所区别', '的', '主要', '特点', '。']]
    
    document = [' '.join(i) for i in corpus_list]
    # ['稀疏 矩阵 是 由 大部分 为 零 的 矩阵 组成 的 矩阵 ,', '这 是 和 稠密 矩阵 有所区别 的 主要 特点 。']

1. Bag of words (BoW), word to vector by word frequency

from sklearn.feature_extraction.text import CountVectorizer

  1. countvector = CountVectorizer(token_pattern=None)
  2. model_fit = countvector.fit(document)
    • print(model_fit)
      CountVectorizer(analyzer='word', binary=False, decode_error='strict',
      dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
      lowercase=True, max_df=1.0, max_features=None, min_df=1,
      ngram_range=(1, 1), preprocessor=None, stop_words=None,
      strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
      tokenizer=None, vocabulary=None)
    • print(model_fit.vocabulary_), 여기 왜 싱클 글자 없을까? token_pattern='(?u)\\b\\w\\w+\\b'regex땜에 단어 길이 둘 넘어야 인식할수있음, 그리고 이렇게 token_pattern=r"(?u)\\b\\w+\\b" # must be double \ 싱클 글자 인식할수있도록 바꿀수있음.
      {'稀疏': 5, '矩阵': 4, '大部分': 1, '组成': 7, '稠密': 6, '有所区别': 2, '主要': 0, '特点': 3}
    • sort a dict
      sort_dict_list = sort( [i for i in model_fit.vocabulary_.items()], key=lambda x:x[1], reverse=False )
      [('主要', 0),('大部分', 1),('有所区别', 2),('特点', 3),('矩阵', 4),('稀疏', 5),('稠密', 6),('组成', 7)]
  3. model_transform = model_fit.transform(document)
    • print(model_transform), sparse matrix稀疏矩阵,(row_inx, column_idx) value
      (0, 1)    1
      (0, 4)    3
      (0, 5)    1
      (0, 7)    1
      (1, 0)    1
      (1, 2)    1
      (1, 3)    1
      (1, 4)    1
      (1, 6)    1
    • model_transform.toarray(), dense matrix稠密矩阵, sorted_dict순서에 따라 vector를 생성하고, 해당 위치에 단어 frequency를 채우고
      array([[0, 1, 0, 0, 3, 1, 0, 1],
           [1, 0, 1, 1, 1, 0, 1, 0]])
  4. All in one

    model = CountVectorizer()
    result = model.fit_tranform(document)
    print(result)
    print(result.toarray())
    print(model.get_feature_names())
      (0, 7)    1
      (0, 1)    1
      (0, 4)    3
      (0, 5)    1
      (1, 3)    1
      (1, 0)    1
      (1, 2)    1
      (1, 6)    1
      (1, 4)    1
    
      [[0 1 0 0 3 1 0 1]
       [1 0 1 1 1 0 1 0]]
    
      ['主要', '大部分', '有所区别', '特点', '矩阵', '稀疏', '稠密', '组成']

2. Word to vector by TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

  1. tfidf = TfidfVectorizer()
  2. model_fit = tfidf.fit(document)
    • print(model_fit.vocabulary_)
      {'稀疏': 5, '矩阵': 4, '大部分': 1, '组成': 7, '稠密': 6, '有所区别': 2, '主要': 0, '特点': 3}
    • sort a dict
      sort_dict = sorted([i for i in model_fit.vocabulary_.items()], key=lambda x:x[1], reverse=False)
      [('主要', 0),('大部分', 1),('有所区别', 2),('特点', 3),('矩阵', 4),('稀疏', 5),('稠密', 6),('组成', 7)]
  3. model_transform = model_fit.transform(document)
    • print(model_transform)
      (0, 7)    0.3637880261736418
      (0, 5)    0.3637880261736418
      (0, 4)    0.7765145304745155
      (0, 1)    0.3637880261736418
      (1, 6)    0.47107781233161794
      (1, 4)    0.33517574332792605
      (1, 3)    0.47107781233161794
      (1, 2)    0.47107781233161794
      (1, 0)    0.47107781233161794
    • model_transform.toarray()
      array(
      [[0.        , 0.36378803, 0.        , 0.        , 0.77651453, 0.36378803, 0.        , 0.36378803],
      [0.47107781, 0.        , 0.47107781, 0.47107781, 0.33517574, 0.        , 0.47107781, 0.        ]])

Reference:

  1. Scikit-learn CountVectorizer与TfidfVectorizer
  2. sklearn: TfidfVectorizer 中文处理及一些使用参数
  3. 文本数据预处理:sklearn 中 CountVectorizer、TfidfTransformer 和 TfidfVectorizer
  4. 机器学习稀疏矩阵简介(附Python代码)
davidkorea commented 6 years ago

#coding=utf-8
from sklearn.feature_extraction.text import TfidfVectorizer
document = ["I have a pen.",
                      "I have an apple."]
tfidf_model = TfidfVectorizer().fit(document)
sparse_result = tfidf_model.transform(document)     # 得到tf-idf矩阵,稀疏矩阵表示法
print(sparse_result)
# (0, 3)    0.814802474667
# (0, 2)    0.579738671538
# (1, 2)    0.449436416524
# (1, 1)    0.631667201738
# (1, 0)    0.631667201738
print(sparse_result.todense())                     # 转化为更直观的一般矩阵
# [[ 0.          0.          0.57973867  0.81480247]
#  [ 0.6316672   0.6316672   0.44943642  0.        ]]
print(tfidf_model.vocabulary_)                      # 词语与列的对应关系

Official guide