elvinpoon / tensorflow-CWS-LSTM

26 stars 4 forks source link

What is windowed LSTM model? #2

Open lengyuewuyazui opened 7 years ago

lengyuewuyazui commented 7 years ago

Q1: What is windowed LSTM model?Is it the model that this paper presented?

Q2: How to transform the raw data like this 迈向 充满 希望 的 新 世纪 —— 一九九八年 新年 讲话 ( 附 图片 1 张 ) to your training data-sets' form?

elvinpoon commented 7 years ago

For Q1: it's what this descirbed. windowed means that you put surrounding word together as input vector. For example, if you have sentence'开心啊' -> [1,2,3](refers to word index in vocabulary table), you will be transforming this sentence to [[B,1,2],[1,2,3],[2,3,E]] as input(B,E are padding index). Finally this will turn into [[embed(B)+embed(1)+embed(2)],[embed(1)+embed(2)+embed(3)],[embed(2)+embed(3)+embed(E)]].

For Q2: you can search online for 'CRF中文分词', refer to this article Here's the code you can use.

#!/usr/bin/env python
#-*-coding:utf-8-*-

#4-tags for character tagging: B(Begin),E(End),M(Middle),S(Single)

import codecs
import sys

def character_tagging(input_file, output_file):
    input_data = codecs.open(input_file, 'r', 'utf-8')
    output_data = codecs.open(output_file, 'w', 'utf-8')
    for line in input_data.readlines():
        word_list = line.strip().split()
        for word in word_list:
            if len(word) == 1:
                output_data.write(word + "\tS\n")
            else:
                output_data.write(word[0] + "\tB\n")
                for w in word[1:len(word)-1]:
                    output_data.write(w + "\tM\n")
                output_data.write(word[len(word)-1] + "\tE\n")
        output_data.write("\n")
    input_data.close()
    output_data.close()

if __name__ == '__main__':
    if len(sys.argv) != 3:
        print ("Usage: python " + argv[0] + " input output")
        sys.exit(-1)
    input_file = sys.argv[1]
    output_file = sys.argv[2]
    character_tagging(input_file, output_file)
lengyuewuyazui commented 7 years ago

Thank you for you help!I test it,there are some bugs,use training data you given,and use order __python modelLSTM.py--train=data/trainSeg.txt --model=model --iters=50_

tensorflow version:0.11.0rc1 numpy version:1.11.2 gensim version:0.13.3

Here is log embedding:data/char2vec_50.model` max len 150 stddev:0.100000 hi Vocab Size: 7008 Training Samples: 100 Valid Samples 100 Layers:1 Hidden Size: 50 Embedding size: 50 Window Size 4 Norm 7 num of batches 5 process:0.000 ErrorRate: 41.291848 cost 0.685807 132 wps num of batches 0 Epoch: 1 Train accuracy: 0.606 Traceback (most recent call last): File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/model_LSTM.py", line 367, in tf.app.run() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 43, in run sys.exit(main(sys.argv[:1] + flags_passthrough)) File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/model_LSTM.py", line 360, in main valid_accuracy = run_epoch(session, m, valid_data, tf.no_op(),cmodel, verbose=False) File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/model_LSTM.py", line 241, in run_epoch model.config.left_window,model.config.right_window,num_class=model.config.num_class)): File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/utils.py", line 146, in batch_iter x, y, l, size = generate_batches(data, batch_size, num_steps, char_embedding, num_class, left, right) File "/home/king/others/tensorflow-CWS-LSTM-master/tensorflow-CWS-LSTM-master/utils.py", line 213, in generate_batches x[n_batch][batch_cnt][pos] = new_x IndexError: index 0 is out of bounds for axis 0 with size 0

elvinpoon commented 7 years ago

Try lstm_build. I forgot to delete the old one. It doesn't work. Also, I've modified the command in README you should use - python model_lstm_build.py --train=data/trainSeg.txt --model=model --iters=50 instead. It's because I haven't modified this repo for a long time, so many things have changed...

lengyuewuyazui commented 7 years ago

there are two tiny bugs in lstm_build.py interrupted 1.Indentation block Error in line 108
2.the decorator of lazy_property Error in line 84 Maybe you could fix it next time you push

elvinpoon commented 7 years ago

I've fixed it in my new commit last night. you can try pulling it. There are some version control issues that kept me from uploading the bug-free code...

lengyuewuyazui commented 7 years ago

Thanks,I infer that you used word2vec model in package of gensim to get .model. I want to get the parameter of this class: class gensim.models.word2vec.Word2Vec(sentences=None, size=100, alpha=0.025, window=5, min_count=5, max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, iter=5, null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000)

if it's convenient to you to leave your code,it will be very helpful.

Thank you very much!