AlexiaChen / AlexiaChen.github.io

My Blog https://github.com/AlexiaChen/AlexiaChen.github.io/issues
88 stars 11 forks source link

用马尔科夫链来做自动补全 #48

Open AlexiaChen opened 4 years ago

AlexiaChen commented 4 years ago

title: 用马尔科夫链来做自动补全 date: 2019-03-10 15:26:00 tags:

前言

马尔科夫链是个什么呢?如果你去网上搜索, 噼里啪啦一堆公式,可能你会看的不耐烦了。当然,你可能见到有些人总结了一句经典的话,比如:

当然,你可以在知乎这里看到更多的解释。

这样的性质就给我们带来了预测下一个状态带来了可能,比如天气预报,智能联想提示等等。

今天我们就用python来用马尔科夫链来做点有实际意义的事情,用更直观的角度理解它。

正文是我用python脚本来分析了一本英文书,把这本书的英文句子作为一个训练集,用马尔科夫链来做处理产生短文,短文句子有一定“联想”,这样也可以算作是一个自动补全的程序了。比如,一些聊天工具和输入法可能就有这样的功能,基于你以前历史的语句输入来判断你当前输入后面可能会输入的单词或语句作提示或者补全。

网络社区也有一些爱好者用马尔科夫链来生成鸡汤文, 都算是一定性质的科普文章。

正文

首先,需要随便下载一个英文小说或者书籍来作为训练集合,英文当然越正宗越地道越好,这样的数据质量比较高,之后生成的段落语句结果也更好些。这里我下载了哈利波特1-7的英文txt。哈哈,英文原版我只读了第一部魔法石,惭愧啊。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

# python 3! due to random.choices()

import operator, random
from collections import defaultdict

def remove_empty_words(l):
    return list(filter(lambda a: a != '', l))

def update_occ(d, seq, w):
    if seq not in d:
        d[seq]=defaultdict(int)

    d[seq][w]=d[seq][w]+1

# https://docs.python.org/3/library/random.html#random.choice
def gen_random_from_table(t):
    return random.choices(list(t.keys()), weights=list(t.values()))[0]

def process(): 
    with open ("harry-potter.txt", "r") as myfile:
        data=myfile.read()

    sentences=data.lower().replace('\r',' ').replace('\n',' ').replace('?','.').replace('!','.').replace('“','.').replace('”','.').replace("\"",".").replace('‘',' ').replace('-',' ').replace('’',' ').replace('\'',' ').split(".")

    # key=list of words, as a string, delimited by space
    #     (I would use list of strings here as key, but list in not hashable)
    # val=dict, k: next word; v: occurrences
    first={}
    second={}

    for s in sentences:
        words=s.replace(',',' ').split(" ")
        words=remove_empty_words(words)
        if len(words)==0:
            continue
        for i in range(len(words)):
            if i>=1:
                update_occ(first, words[i-1], words[i])
            if i>=2:
                update_occ(second, words[i-2]+" "+words[i-1], words[i])

    text=["magic", "is"]

    text_len=len(text)

    for i in range(200):

        last_idx=text_len-1

        tmp=text[last_idx-1]+" "+text[last_idx]
        if tmp in second:
            new_word=gen_random_from_table(second[tmp])
        else:
            # fall-back to 1st order
            tmp2=text[last_idx]
            if tmp2 not in first:
                # dead-end
                break
            new_word=gen_random_from_table(first[tmp2])

        text.append(new_word)
        text_len=text_len+1

    print (" ".join(text))

# main function
process()

运行以上代码,就可以生成一个150来个单词的"联想"短文了:

magic is not having practised vanishing spells horribly difficult at first but they two thousand racing broom tampering and torture; 
an interview said harry leaping to their bulky backpacks the nine of the cottages were fewer here and make it too far; 
it rolled over toward him holding a wand to shoulder wands raised james glancing over his head pound so he went as red as ron who clambered to
 his determination to shut his eyes wanting to know anything new right before you go upstairs with ron eating their way around the walls and 
 all their shock and exhaustion that the phoenix but after a few horrible seconds he did to open the lestranges vault only once a month 
 later harry kept his eyes was twitching slightly hair whipped back off the 
 pitch but all three of my followers would give better cover they extinguished their fire then he sat down when something of a voice 
 and called out of pure venom at harry his temper getting the exit and when sounds of someone stumbling ftom a room like a venomous bubble compressing his lungs driving all other concerns fled 
 harry s head with a large fir tree blocking the top

当然这段短文有点狗屁不通,甚至有点搞笑,我们现在先不管。

我们先来看看代码中的first和second变量,它们是一个字典。

first变量表示第一个单词后出现的单词的概率统计,这是一阶马尔科夫链生成的数据结果。

我们通过python把first的结果排序打印出来好了:

print ("first table:")
for k in first:
    print (k)
    # https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
    s=sorted(first[k].items(), key=operator.itemgetter(1), reverse=True)
    print (s[:20])
    print ("")
harry
[('s', 1344), ('and', 776), ('had', 657), ('was', 475), ('could', 400), ('potter', 385), ('said', 358), ('saw', 271), ('looked', 241), ('felt', 238), ('ron', 233), ('who', 201), ('thought', 172), ('asked', 163), ('as', 160), ('knew', 156), ('did', 125), ('i', 115), ('to', 108), ('in', 106)]

potter
[('and', 63), ('s', 59), ('said', 28), ('is', 26), ('you', 23), ('has', 21), ('i', 18), ('the', 17), ('was', 13), ('he', 13), ('sir', 13), ('will', 10), ('had', 10), ('she', 10), ('in', 9), ('must', 9), ('that', 9), ('stinks', 7), ('not', 6), ('come', 5)]

and
[('the', 1225), ('hermione', 954), ('he', 907), ('harry', 662), ('a', 605), ('then', 554), ('ron', 550), ('i', 541), ('his', 363), ('george', 341), ('it', 298), ('they', 289), ('you', 253), ('she', 243), ('was', 232), ('that', 211), ('there', 190), ('saw', 187), ('said', 171), ('looked', 161)]

the
[('door', 770), ('dark', 637), ('room', 629), ('other', 548), ('ministry', 477), ('first', 455), ('floor', 408), ('same', 386), ('way', 362), ('end', 362), ('castle', 358), ('air', 334), ('ground', 326), ('only', 326), ('back', 317), ('last', 296), ('table', 292), ('rest', 284), ('front', 263), ('death', 260)]

sorcerer
[('s', 24), ('in', 3), ('of', 1), ('spoke', 1)]

好了,我们现在再来看看second变量的结果,它表示两个单词之后出现的每个单词的概率统计,这是二阶马尔科夫链的结果,依次类推,还有3阶马尔科夫链,只不过代码里面没写。

print ("second table:")
    for k in second:
        print (k)
        # https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
        s=sorted(second[k].items(), key=operator.itemgetter(1), reverse=True)
        print (s[:20])
        print ("")
which made
[('the', 3), ('it', 2), ('harry', 2), ('drills', 1), ('professor', 1), ('ginny', 1), ('his', 1), ('ron', 1), ('him', 1), ('a', 1)]

he was
[('going', 96), ('not', 88), ('a', 66), ('still', 65), ('sure', 50), ('in', 47), ('doing', 43), ('the', 42), ('looking', 40), ('wearing', 38), ('so', 32), ('about', 29), ('trying', 29), ('on', 27), ('being', 25), ('very', 22), ('lying', 22), ('standing', 22), ('too', 19), ('supposed', 19)]

was a
[('very', 48), ('loud', 39), ('pause', 35), ('great', 35), ('long', 25), ('bit', 22), ('little', 20), ('good', 19), ('large', 16), ('moment', 16), ('flash', 15), ('few', 14), ('small', 14), ('tiny', 13), ('wizard', 12), ('bang', 12), ('nasty', 11), ('knock', 11), ('sudden', 11), ('lot', 10)]

a big
[('mistake', 3), ('deal', 3), ('plastic', 2), ('supporter', 2), ('party', 2), ('man', 2), ('beefy', 1), ('city', 1), ('argument', 1), ('old', 1), ('photograph', 1), ('green', 1), ('lumpy', 1), ('black', 1), ('red', 1), ('thing', 1), ('drop', 1), ('enough', 1), ('bet', 1), ('bearded', 1)]

big beefy
[('man', 1)]

好了,这里先暂时止步探索,你会发现二阶马尔科夫链的结果更加精确,生成的数据集也更加小。这不废话嘛,就像猜谜语一样,提示给的越精确越多,那么你猜中的概率越大,解的空间范围就缩小了,second表也就更小了。哈哈,开始有点意思了。

好的,我们现在在二阶马尔科夫链的基础上加入3阶马尔科夫链,按照2阶的代码修改就行了:

    first={}
    second={}
    third={}

    for s in sentences:
        words=s.replace(',',' ').split(" ")
        words=remove_empty_words(words)
        if len(words)==0:
            continue
        for i in range(len(words)):
            if i>=1:
                update_occ(first, words[i-1], words[i])
            if i>=2:
                update_occ(second, words[i-2]+" "+words[i-1], words[i])
            if i>=3:
                update_occ(third, words[i-3]+" "+words[i-2]+" "+words[i-1],words[i])
were proud to
[('say', 1)]

proud to say
[('that', 1)]

to say that
[('he', 3), ('she', 3), ('i', 2), ('they', 1), ('s', 1), ('you', 1), ('his', 1), ('from', 1), ('your', 1), ('at', 1), ('on', 1), ('age', 1), ('dumbledore', 1), ('it', 1)]

say that they
[('were', 1), ('know', 1)]

that they were
[('not', 9), ('all', 7), ('there', 3), ('going', 3), ('alone', 2), ('being', 2), ('indeed', 2), ('in', 2), ('a', 2), ('the', 2), ('still', 2), ('supposed', 2), ('no', 2), ('really', 2), ('perfectly', 1), ('related', 1), ('proud', 1), ('witches', 1), ('fifty', 1), ('already', 1)]

从以上可以看出,3阶显然更精确,并且third表也更小了。

以下为了让各位观众更清晰的看到3阶马尔科夫链的生成结果更准确,我用更加可读的输出打印出来:


    tests=["i can tell", "who did this", "she was a", "he was a", "i did not",
        "wanted to do", "you will find"]

    for test in tests:
        test_words=test.split(" ")

        test_len=len(test_words)
        last_idx=test_len-1

        if test_len>=3:
            tmp=test_words[last_idx-2]+" "+test_words[last_idx-1]+" "+test_words[last_idx]
            if tmp in third:
                print ("* third order. for sequence:",tmp)
                print_stat(third[tmp])

        if test_len>=2:
            tmp=test_words[last_idx-1]+" "+test_words[last_idx]
            if tmp in second:
                print ("* second order. for sequence:", tmp)
                print_stat(second[tmp])

        if test_len>=1:
            tmp=test_words[last_idx]
            if tmp in first:
                print ("* first order. for word:", tmp)
                print_stat(first[tmp])
        print ("")

输出:

* third order. for sequence: i can tell
you 42%
yeh 14%
who 7%
he 7%
mum 7%
* second order. for sequence: can tell
me 25%
you 20%
us 10%
yeh 5%
him 5%
* first order. for word: tell
you 17%
me 14%
him 10%
us 7%
them 5%

* second order. for sequence: did this
mean 27%
but 9%
year 9%
weird 9%
band 9%
* first order. for word: this
is 10%
time 6%
was 5%
year 2%
one 2%

* third order. for sequence: she was a
witch 12%
bit 12%
great 8%
very 8%
freak 4%
* second order. for sequence: was a
very 3%
loud 3%
pause 2%
great 2%
long 2%
* first order. for word: a
few 2%
little 2%
bit 1%
moment 1%
very 1%

* third order. for sequence: he was a
bit 7%
good 7%
wizard 6%
death 6%
big 4%
* second order. for sequence: was a
very 3%
loud 3%
pause 2%
great 2%
long 2%
* first order. for word: a
few 2%
little 2%
bit 1%
moment 1%
very 1%

* third order. for sequence: i did not
want 12%
have 8%
know 8%
ask 4%
fasten 4%
* second order. for sequence: did not
know 8%
want 7%
seem 6%
answer 4%
look 3%
* first order. for word: not
to 7%
be 3%
a 2%
have 2%
the 1%

* third order. for sequence: wanted to do
was 33%
the 16%
but 8%
with 8%
his 8%
* second order. for sequence: to do
with 20%
it 14%
the 4%
something 3%
was 2%
* first order. for word: do
you 20%
it 12%
not 8%
with 6%
that 3%

* third order. for sequence: you will find
that 28%
a 14%
unless 7%
everything 7%
on 7%
* second order. for sequence: will find
that 23%
it 14%
me 9%
a 9%
uses 4%
* first order. for word: find
out 19%
a 8%
the 7%
it 5%
him 4%

从上面的输出可以看出,i can tell 接 you这个单词的概率42%,要比can tell 接 you这个单词的概率20% 要大。显然结果就更精确了。如果调试得当,那么生成的短文结果可能还真看起来像那么回事。

总结

最后,提示一下,first,second,third这三张表可以结合起来使用,third这张表的权重显然更高,依次类推。如果你正在做一个输入法或者聊天提示的功能,可以简单采用这样的方法,先收集用户的输入语句等信息作为训练集合,来构建这几张表,这几张表的优先级最高,优先级最低的是常用的固定成语,词语的搭配,还有流行网络用语等等。