Open Tigeryang93 opened 6 years ago
@babyhuzi111 one possibility may be the tokenizer, if you can share with me your training chinese text file. i can try it with my models and let you know.
The data like this, each line has two sentences, left is raw sentence, right is the target sentence. Thank you.
------------------ 原始邮件 ------------------ 发件人: "Xianshun Chen"notifications@github.com; 发送时间: 2018年4月17日(星期二) 上午8:22 收件人: "chen0040/keras-text-summarization"keras-text-summarization@noreply.github.com; 抄送: "杨虎"1064549937@qq.com; "Mention"mention@noreply.github.com; 主题: Re: [chen0040/keras-text-summarization] Similar task has some problem(#3)
@babyhuzi111 one possibility may be the tokenizer, if you can share with me your training chinese text file. i can try it with my models and let you know.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
从QQ邮箱发来的超大附件
data_clean.raw (133.01M, 2018年05月17日 09:43 到期)进入下载页面:http://mail.qq.com/cgi-bin/ftnExs_download?k=276133357cd5cbc755df400a4530561f0157040d510253064900550c061d500500041e0d0254061d54525153065150005252520c633e64540515526a005c01510a4f4154143059&t=exs_ftn_download&code=da35c0d0
I forgot some line has only one sentence, just let it go.
------------------ 原始邮件 ------------------ 发件人: "Xianshun Chen"notifications@github.com; 发送时间: 2018年4月17日(星期二) 上午8:22 收件人: "chen0040/keras-text-summarization"keras-text-summarization@noreply.github.com; 抄送: "杨虎"1064549937@qq.com; "Mention"mention@noreply.github.com; 主题: Re: [chen0040/keras-text-summarization] Similar task has some problem(#3)
@babyhuzi111 one possibility may be the tokenizer, if you can share with me your training chinese text file. i can try it with my models and let you know.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@babyhuzi111 测试结果我的也是得到重复性的某个词,这个问题你解决了吗
@babyhuzi111 测试结果我的也是得到重复性的某个词,这个问题你解决了吗 你们是如何做prediction的。我认为你们的prediction 可能有问题
Hi, I am doing some task,that is very similar to your task, expect its inputs and outputs are chinese . The frame is also seq2seq. I write code as same as yours. When I run the code, the train accuracy get very high, but when I test it, the decode output is always "的的的的的的的的的的的的的的" or "哪哪哪哪哪哪哪哪哪哪哪哪" or "PADPADPADPADPADPADPADPADPADPADPAD". I have no idea about it. The model code like this:
encoder model
embedding_size = 50 encoder_inputs = Input(shape=(None,)) en_x = Embedding(vocab_size, embedding_size)(encoder_inputs)
encoder = LSTM(50, return_state=True) encoder_outputs, state_h, state_c = encoder(en_x) encoder_states = [state_h, state_c]
decoder model
decoder_inputs = Input(shape=(None,)) dex = Embedding(vocab_size, embedding_size) final_dex = dex(decoder_inputs) decoder_lstm = LSTM(50, return_sequences=True, return_state=True) decoderoutputs,,_ = decoder_lstm(final_dex, initial_state=encoder_states) decoder_dense = Dense(vocab_size, activation='softmax') decoder_outputs = decoder_dense(decoder_outputs)
model
model = Model([encoder_inputs, decoder_inputs], decoder_outputs) model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy']) model.summary()
and the batch data is gerenate like this: def mygenerator(batch_size): max_batch_index = len(trainx) // batch_size i = 0 while 1: batch_trainy_categ = to_categorical(trainy[ibatch_size:(i+1)batch_size].reshape(batch_sizemax_sentB_len), num_classes=vocab_size) batch_trainy_categ = np.array(batch_trainy_categ).reshape(-1, max_sentB_len, vocab_size) batch_trainx = trainx[ibatch_size:(i+1)batch_size] batch_trainy = trainy[ibatch_size:(i+1)*batch_size] i += 1 i = i % max_batch_index
print('batch data:')
model.fit_generator(mygenerator(128), steps_per_epoch=len(trainx) // 128, epochs=1, verbose=1, validation_data=([testx, testy], testy_catey)) can you give me some advice about debugging or reason? thank you.