Closed playma closed 6 years ago
Interesting. What is the size of these 2 corpus ? didn't you specify any training batch_size ? When you read the output, do you see a good fluency ?
Very neat. What is the current state of the art on these two tasks?
Also you might try with copy and shared vocabulary as well. Those help for english.
Results from this paper look similar: Topic Sensitive Neural Headline Generation https://arxiv.org/pdf/1608.05777.pdf
Rouge-1 Rouge-2 Rouge-L Baseline 34.7 22.9 32.5 CopyNet 34.4 21.6 31.3 Topic-5 38.4 26.6 36.1
Huh, if you believe this other paper you are already near state of the art: A Semantic Relevance Based Neural Network for Text Summarization and Text Simplification https://arxiv.org/pdf/1710.02318.pdf
Model ROUGE-1 ROUGE-2 ROUGE-L Seq2seq (W) (Hu, Chen, and Zhu 2015) 26.8 16.1 24.1 Seq2seq (C) (Hu, Chen, and Zhu 2015) 29.9 17.4 27.2 Seq2seq-Attention (W) (Hu, Chen, and Zhu 2015) 26.8 16.1 24.1 Seq2seq-Attention (C) (Hu, Chen, and Zhu 2015) 29.9 17.4 27.2 COPYNET (C) (Gu et al. 2016) 35.0 22.3 32.0 SRB (C) (our proposal) 33.3 20.0 30.1
@vince62s
Chinese Gigaword dataset Training data: 2,233,820 Testing data: 54669 Details: [(https://catalog.ldc.upenn.edu/LDC2003T09)]
LCSTS dataset Training data: 2,400,591 Testing data: 725 Details: [(https://arxiv.org/abs/1506.05865v2)]
batch_size
I use the default value 64
Most output sentences are fluent and have similar meanings as the answer. However, due to the evaluation metrics, some sentences fail to get high scores or even get zero because the results are different from the ground truth.
@srush
This is the state of the art about LCSTS datset, but it is not complete.
@srush
I get into trouble.
When I modify rnn_size
from 300 to 500, the training loss does not always decrease.
The loss raises at epoch 6.
Do you know the potential reason?
Hard to say, what about other optimizer based on adaptive learning rate? (adam, adadelta)
Hmm, sometimes that happens if the -max_grad_norm is too high. For english summary we have been using -optim adagrad -learning_rate 0.15 -adagrad_accumulator_init 0.1
, maybe try that?
Are you doing word or character?
If you want to try CopyNet, do -dynamic_dict -share_vocab
during preprocessing and -copy_attn -global_attention mlp
during train. Also you can use a smaller vocab.
Would love to try some of the other methods. Particularly DRGN. I think the Min risk will be slow, but we could try adding it.
Also there is a decoding trick people sometime try that we can add. It lets you add a boost for when you copy unique bigrams from the source. I think it will help here. We can just add that to decoding. (It would be a global scorer similar to the GNMT scorer we have).
@helson73 Thank you. I will try it.
@srush I will try it. Using character always beats using word in Chinese Summarization(Except CopyNet).
I am interested in Minimum risk training. Although it didn't solve the practical problem, it is effective for performance competition.
Thank you for building the wonderful tool, it helps me a lot. If there is any good finding, I will publish it.
Sounds great. A lot of my students are interested in summarization. So happy to see these benchmarks being added.
@srush I also have some questions. Does OpenNMT-py provide momentum or other good methods for adaptive learning rate?
I thought the learning_rate_decay
is not suitable.
In addition to sgd, we also implemented adagrad, adadelta and adam (http://opennmt.net/OpenNMT-py/options/train.html#optimization-type). It should be easy to add momentum by modifying https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Optim.py#L59, but my personal experience is that SGD with proper learning rate decay gives the best result.
@da03 Thank you !
My problem is the validation loss does not decrease at epoch 3.
So the -learning_rate_decay
will start at epoch 4 and the training will be very slow.
Maybe I should modify the -learning rate decay
value?
Hmm I think that's fine, you might try a smaller initial learning rate to avoid such issue, but that shouldn't affect the final performance too much.
Feel free to change this line if you don't want it to decay too early:
https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Optim.py#L104
We can add other options if you think they make sense.
We could also have a "start_decay_min_epoch" option
@da03 SGD with learning rate decay beats Adagrad and gives the best result?
That's my experience, in terms of final performance (maybe Ada methods converge faster?)
adagrad can get better results than SGD in summarization with proper hyperparameters. You need to set adagrad_accumulator_init
to ~0.1 and the learning rate to ~0.15 on the CNNDM corpus, I assume that similar parameters could work for the chinese corpus.
Do you have a comparison against extractive baseline?
@pltrdy No, I don't have 😢
I use the different optimizer and get the better result in LCSTS.
-optim adagrad -adagrad_accumulator_init 0.1 -learning_rate 0.15
LCSTS dataset
Rouge-1 / ROUGE-2 / ROUGE-L
35.67 / 23.06 / 33.14
Hi @playma can we post your model here? http://opennmt.net/Models-py/
We are putting up baselines for a bunch of tasks.
@srush Sure, I am glad to share it.
Fantastic. Can you email @da03 with a link (maybe google drive) and he'll post.
This model is not run on the latest version of OpenNMT-py. Should I train a model on the latest version?
Hmm the old version is fine, can you send it to dengyuntian@g.harvard.edu when you had chance? I can check if it works for the current version. ThX!
@playma
@da03 I just sent it to you.
@playma 要分词吗? 输入的文件格式是下面哪种 "多家 在线 医疗 平台 近日 收到 苹果公司 要求 , 应用 内 购买 需要 使用 IAP 服务 , 并 缴纳 30% 的 交易所 得 。" "多家在线医疗平台近日收到苹果公司要求,应用内购买需要使用IAP服务,并缴纳30%的交易所得。"
使用你发布的模型,下面的命令行是对的吗?我这里一直生成的不对,summ.pt是你发的模型 python translate.py -model summ.pt -src input.jieba.txt -output output.txt -verbose -batch_size 1 -replace_unk -beam_size 10
@5118Python
The input should be below:
多 家 在 线 医 疗 平 ...
Characters are split by a space.
Hello @playma Thanks for the model. Have you update the model using transformer on LCSTS? I tried that, but the performance is not good. Btw, can you share the model with "Gigaword Chinese dataset"?
Thanks so much.
This the report for Chinese Abstractive summarization performance. Welcome to discuss.
Result
Preprocessing script
Training script
Generating script