OpenNMT / OpenNMT-py

Open Source Neural Machine Translation and (Large) Language Models in PyTorch
https://opennmt.net/
MIT License
6.74k stars 2.25k forks source link

Report for Chinese Abstractive summarization performance #547

Closed playma closed 6 years ago

playma commented 6 years ago

This the report for Chinese Abstractive summarization performance. Welcome to discuss.

Result

LCSTS dataset
Rouge-1 / ROUGE-2 / ROUGE-L
34.8 / 22.5 / 32.3

Gigaword Chinese dataset
Rouge-1 / ROUGE-2 / ROUGE-L
51.92 / 38.39 / 49.12

Preprocessing script

python3 preprocess.py \  
      -train_src $ORIGIN_DIR/train.source \  
      -train_tgt $ORIGIN_DIR/train.target \  
      -valid_src $ORIGIN_DIR/valid.source \
      -valid_tgt $ORIGIN_DIR/valid.target \
      -src_vocab_size 8000 \
      -tgt_vocab_size 8000 \
      -src_seq_length 400 \
      -tgt_seq_length 30 \
      -src_seq_length_trunc 400 \
      -tgt_seq_length_trunc 100 \
      -max_shard_size 20000000 \
      -save_data $DATA_DIR/processed

Training script

 python3 train.py \
      -data $DATA_DIR/processed \
      -word_vec_size 500 \
      -encoder_type brnn \
      -epochs 30 \
      -enc_layers 1 \
      -dec_layers 1 \
      -rnn_size 300 \
      -gpuid 0 \
      -save_model $MODEL_DIR/ \
      > $MODEL_DIR/log.txt

Generating script

 python3 translate.py \
      -model $MODEL_DIR/$BEST_MODEL \
      -beam_size 5 \
      -verbose \
      -batch_size 1 \
      -tgt $GOLD \
      -output $MODEL_DIR/$PRED \
      -src $TEST
vince62s commented 6 years ago

Interesting. What is the size of these 2 corpus ? didn't you specify any training batch_size ? When you read the output, do you see a good fluency ?

srush commented 6 years ago

Very neat. What is the current state of the art on these two tasks?

Also you might try with copy and shared vocabulary as well. Those help for english.

srush commented 6 years ago

Results from this paper look similar: Topic Sensitive Neural Headline Generation https://arxiv.org/pdf/1608.05777.pdf

Rouge-1 Rouge-2 Rouge-L Baseline 34.7 22.9 32.5 CopyNet 34.4 21.6 31.3 Topic-5 38.4 26.6 36.1

srush commented 6 years ago

Huh, if you believe this other paper you are already near state of the art: A Semantic Relevance Based Neural Network for Text Summarization and Text Simplification https://arxiv.org/pdf/1710.02318.pdf

Model ROUGE-1 ROUGE-2 ROUGE-L Seq2seq (W) (Hu, Chen, and Zhu 2015) 26.8 16.1 24.1 Seq2seq (C) (Hu, Chen, and Zhu 2015) 29.9 17.4 27.2 Seq2seq-Attention (W) (Hu, Chen, and Zhu 2015) 26.8 16.1 24.1 Seq2seq-Attention (C) (Hu, Chen, and Zhu 2015) 29.9 17.4 27.2 COPYNET (C) (Gu et al. 2016) 35.0 22.3 32.0 SRB (C) (our proposal) 33.3 20.0 30.1

playma commented 6 years ago

@vince62s

Chinese Gigaword dataset Training data: 2,233,820 Testing data: 54669 Details: [(https://catalog.ldc.upenn.edu/LDC2003T09)]

LCSTS dataset Training data: 2,400,591 Testing data: 725 Details: [(https://arxiv.org/abs/1506.05865v2)]

batch_size I use the default value 64

Most output sentences are fluent and have similar meanings as the answer. However, due to the evaluation metrics, some sentences fail to get high scores or even get zero because the results are different from the ground truth.

playma commented 6 years ago

@srush

This is the state of the art about LCSTS datset, but it is not complete. screen shot 2018-02-01 at 4 57 00 pm

playma commented 6 years ago

@srush

I get into trouble. When I modify rnn_size from 300 to 500, the training loss does not always decrease. The loss raises at epoch 6.

Do you know the potential reason?

screen shot 2018-02-01 at 5 06 03 pm
helson73 commented 6 years ago

Hard to say, what about other optimizer based on adaptive learning rate? (adam, adadelta)

srush commented 6 years ago

Hmm, sometimes that happens if the -max_grad_norm is too high. For english summary we have been using -optim adagrad -learning_rate 0.15 -adagrad_accumulator_init 0.1, maybe try that?

Are you doing word or character?

If you want to try CopyNet, do -dynamic_dict -share_vocab during preprocessing and -copy_attn -global_attention mlp during train. Also you can use a smaller vocab.

Would love to try some of the other methods. Particularly DRGN. I think the Min risk will be slow, but we could try adding it.

Also there is a decoding trick people sometime try that we can add. It lets you add a boost for when you copy unique bigrams from the source. I think it will help here. We can just add that to decoding. (It would be a global scorer similar to the GNMT scorer we have).

playma commented 6 years ago

@helson73 Thank you. I will try it.

playma commented 6 years ago

@srush I will try it. Using character always beats using word in Chinese Summarization(Except CopyNet).

I am interested in Minimum risk training. Although it didn't solve the practical problem, it is effective for performance competition.

Thank you for building the wonderful tool, it helps me a lot. If there is any good finding, I will publish it.

srush commented 6 years ago

Sounds great. A lot of my students are interested in summarization. So happy to see these benchmarks being added.

playma commented 6 years ago

@srush I also have some questions. Does OpenNMT-py provide momentum or other good methods for adaptive learning rate?

I thought the learning_rate_decay is not suitable.

da03 commented 6 years ago

In addition to sgd, we also implemented adagrad, adadelta and adam (http://opennmt.net/OpenNMT-py/options/train.html#optimization-type). It should be easy to add momentum by modifying https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Optim.py#L59, but my personal experience is that SGD with proper learning rate decay gives the best result.

playma commented 6 years ago

@da03 Thank you ! My problem is the validation loss does not decrease at epoch 3. So the -learning_rate_decay will start at epoch 4 and the training will be very slow.

Maybe I should modify the -learning rate decay value?

screen shot 2018-02-03 at 10 24 32 am
da03 commented 6 years ago

Hmm I think that's fine, you might try a smaller initial learning rate to avoid such issue, but that shouldn't affect the final performance too much.

srush commented 6 years ago

Feel free to change this line if you don't want it to decay too early:

https://github.com/OpenNMT/OpenNMT-py/blob/master/onmt/Optim.py#L104

We can add other options if you think they make sense.

srush commented 6 years ago

We could also have a "start_decay_min_epoch" option

playma commented 6 years ago

@da03 SGD with learning rate decay beats Adagrad and gives the best result?

da03 commented 6 years ago

That's my experience, in terms of final performance (maybe Ada methods converge faster?)

sebastianGehrmann commented 6 years ago

adagrad can get better results than SGD in summarization with proper hyperparameters. You need to set adagrad_accumulator_init to ~0.1 and the learning rate to ~0.15 on the CNNDM corpus, I assume that similar parameters could work for the chinese corpus.

pltrdy commented 6 years ago

Do you have a comparison against extractive baseline?

playma commented 6 years ago

@pltrdy No, I don't have 😢

playma commented 6 years ago

I use the different optimizer and get the better result in LCSTS. -optim adagrad -adagrad_accumulator_init 0.1 -learning_rate 0.15

LCSTS dataset
Rouge-1 / ROUGE-2 / ROUGE-L
35.67 / 23.06 / 33.14
srush commented 6 years ago

Hi @playma can we post your model here? http://opennmt.net/Models-py/

We are putting up baselines for a bunch of tasks.

playma commented 6 years ago

@srush Sure, I am glad to share it.

srush commented 6 years ago

Fantastic. Can you email @da03 with a link (maybe google drive) and he'll post.

playma commented 6 years ago

This model is not run on the latest version of OpenNMT-py. Should I train a model on the latest version?

da03 commented 6 years ago

Hmm the old version is fine, can you send it to dengyuntian@g.harvard.edu when you had chance? I can check if it works for the current version. ThX!

da03 commented 6 years ago

@playma

playma commented 6 years ago

@da03 I just sent it to you.

5118Python commented 5 years ago

@playma 要分词吗? 输入的文件格式是下面哪种 "多家 在线 医疗 平台 近日 收到 苹果公司 要求 , 应用 内 购买 需要 使用 IAP 服务 , 并 缴纳 30% 的 交易所 得 。" "多家在线医疗平台近日收到苹果公司要求,应用内购买需要使用IAP服务,并缴纳30%的交易所得。"

使用你发布的模型,下面的命令行是对的吗?我这里一直生成的不对,summ.pt是你发的模型 python translate.py -model summ.pt -src input.jieba.txt -output output.txt -verbose -batch_size 1 -replace_unk -beam_size 10

playma commented 5 years ago

@5118Python The input should be below: 多 家 在 线 医 疗 平 ... Characters are split by a space.

Jiramew commented 5 years ago

Hello @playma Thanks for the model. Have you update the model using transformer on LCSTS? I tried that, but the performance is not good. Btw, can you share the model with "Gigaword Chinese dataset"?

Thanks so much.