chen0040 / keras-text-summarization

Text summarization using seq2seq in Keras
MIT License
290 stars 128 forks source link

Similar task has some problem #3

Open Tigeryang93 opened 6 years ago

Tigeryang93 commented 6 years ago

Hi, I am doing some task,that is very similar to your task, expect its inputs and outputs are chinese . The frame is also seq2seq. I write code as same as yours. When I run the code, the train accuracy get very high, but when I test it, the decode output is always "的的的的的的的的的的的的的的" or "哪哪哪哪哪哪哪哪哪哪哪哪" or "PADPADPADPADPADPADPADPADPADPADPAD". I have no idea about it. The model code like this:

encoder model

embedding_size = 50 encoder_inputs = Input(shape=(None,)) en_x = Embedding(vocab_size, embedding_size)(encoder_inputs)

encoder = LSTM(50, return_state=True) encoder_outputs, state_h, state_c = encoder(en_x) encoder_states = [state_h, state_c]

decoder model

decoder_inputs = Input(shape=(None,)) dex = Embedding(vocab_size, embedding_size) final_dex = dex(decoder_inputs) decoder_lstm = LSTM(50, return_sequences=True, return_state=True) decoderoutputs,,_ = decoder_lstm(final_dex, initial_state=encoder_states) decoder_dense = Dense(vocab_size, activation='softmax') decoder_outputs = decoder_dense(decoder_outputs)

model

model = Model([encoder_inputs, decoder_inputs], decoder_outputs) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) model.summary()

and the batch data is gerenate like this: def mygenerator(batch_size): max_batch_index = len(trainx) // batch_size i = 0 while 1: batch_trainy_categ = to_categorical(trainy[ibatch_size:(i+1)batch_size].reshape(batch_sizemax_sentB_len), num_classes=vocab_size) batch_trainy_categ = np.array(batch_trainy_categ).reshape(-1, max_sentB_len, vocab_size) batch_trainx = trainx[ibatch_size:(i+1)batch_size] batch_trainy = trainy[ibatch_size:(i+1)*batch_size] i += 1 i = i % max_batch_index

print('batch data:')

    # print(batch_trainx[:1])
    # print(batch_trainy[:1])
    # print(batch_trainy_categ[:1])
    yield ([batch_trainx, batch_trainy], batch_trainy_categ)

model.fit_generator(mygenerator(128), steps_per_epoch=len(trainx) // 128, epochs=1, verbose=1, validation_data=([testx, testy], testy_catey)) can you give me some advice about debugging or reason? thank you.

chen0040 commented 6 years ago

@babyhuzi111 one possibility may be the tokenizer, if you can share with me your training chinese text file. i can try it with my models and let you know.

Tigeryang93 commented 6 years ago

The data like this, each line has two sentences, left is raw sentence, right is the target sentence. Thank you.

------------------ 原始邮件 ------------------ 发件人: "Xianshun Chen"notifications@github.com; 发送时间: 2018年4月17日(星期二) 上午8:22 收件人: "chen0040/keras-text-summarization"keras-text-summarization@noreply.github.com; 抄送: "杨虎"1064549937@qq.com; "Mention"mention@noreply.github.com; 主题: Re: [chen0040/keras-text-summarization] Similar task has some problem(#3)

@babyhuzi111 one possibility may be the tokenizer, if you can share with me your training chinese text file. i can try it with my models and let you know.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

从QQ邮箱发来的超大附件

data_clean.raw (133.01M, 2018年05月17日 09:43 到期)进入下载页面:http://mail.qq.com/cgi-bin/ftnExs_download?k=276133357cd5cbc755df400a4530561f0157040d510253064900550c061d500500041e0d0254061d54525153065150005252520c633e64540515526a005c01510a4f4154143059&t=exs_ftn_download&code=da35c0d0

Tigeryang93 commented 6 years ago

I forgot some line has only one sentence, just let it go.

------------------ 原始邮件 ------------------ 发件人: "Xianshun Chen"notifications@github.com; 发送时间: 2018年4月17日(星期二) 上午8:22 收件人: "chen0040/keras-text-summarization"keras-text-summarization@noreply.github.com; 抄送: "杨虎"1064549937@qq.com; "Mention"mention@noreply.github.com; 主题: Re: [chen0040/keras-text-summarization] Similar task has some problem(#3)

@babyhuzi111 one possibility may be the tokenizer, if you can share with me your training chinese text file. i can try it with my models and let you know.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

fdujuan commented 6 years ago

@babyhuzi111 测试结果我的也是得到重复性的某个词,这个问题你解决了吗

kevin369ml commented 5 years ago

@babyhuzi111 测试结果我的也是得到重复性的某个词,这个问题你解决了吗 你们是如何做prediction的。我认为你们的prediction 可能有问题