facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.37k stars 6.4k forks source link

summarization help #154

Closed bittik closed 6 years ago

bittik commented 6 years ago

how do i run summarization using fairseq and in which format is data required ?

huihuifan commented 6 years ago

For summarization, you should follow the instructions of IWSLT data preprocessing and training but with a summarization dataset.

lexmen318 commented 5 years ago

can you tell me which kind of dataset could I choose when doing text summarization? I found the IWSLT data preprocessing will generate a LanguagePairDataset, so I doubt this kind of dataset is suitable for summarization task. Thx

huihuifan commented 5 years ago

Common datasets used include Gigaword and CNN Dailymail

lexmen318 commented 5 years ago

Sorry, maybe I was not clear.

As above advise, I traced the IWSLT data preprocessing and training instruction, and found that fairseq chose "translation" task and LanguagePairDataset. So I am confused that if I used another dataset to do text summarization, will it still use "translation" task and LanguagePairDataset?

And another question, should I implement a new tokenizer to support Chinese text summarization? (I guess it is necessary) thx. a lot!

tagucci commented 5 years ago

Yes, "translation" is the way you can train summarization models on fairseq. In summarization, your model is trained to "translate" src (article) to trg (summary). The process of preprocess and train is same as this examples/translation/README.md.

In CLI command, the process is like blow.

# preprocess
$ fairseq-preprocess --source-lang src --target-lang trg \
  --trainpref data/train --validpref data/valid --testpref $data/test \
  --destdir data-bin/summarization-dataset

# train
$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/summarization-dataset \
  -a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s src -t trg \
  --label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
  --min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
  --criterion label_smoothed_cross_entropy --max-update 50000 \
  --warmup-updates 4000 --warmup-init-lr '1e-07' \
  --adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer

Before preprocessing, you should prepare "tokenized" dataset by jeiba as below.

这是 一个 测试 。
...

However, you should consider bpe or sentencepiece to tokenize sentence because standard tokenizer cause large vocabulary size.

lexmen318 commented 5 years ago

thank you very much for weekend replying.

Need I do beam-searching?


刘喜明 Liu Ximing Mobile: +86-13828701940<tel:+86-13828701940> Email: liuximing1@huawei.commailto:liuximing1@huawei.com 发件人:tagucci notifications@github.com 收件人:pytorch/fairseq fairseq@noreply.github.com 抄 送:Liuximing (Lexmen, AARC) liuximing1@huawei.com;Comment comment@noreply.github.com 时间:2019-07-27 21:45:35 主题Re: [pytorch/fairseq] summarization help (#154)

Yes, "translation" is the way you can train summarization models on fairseq. In summarization, your model is trained to "translate" src (article) to trg (summary). The process of preprocess and train is same as this examples/translation/README.mdhttps://github.com/pytorch/fairseq/tree/master/examples/translation.

In CLI command, the process is like blow.

preprocess

$ fairseq-preprocess --source-lang src --target-lang trg \

--trainpref data/train --validpref data/valid --testpref $data/test \

--destdir data-bin/summarization-dataset

train

$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/summarization-dataset \

-a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s src -t trg \

--label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \

--min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \

--criterion label_smoothed_cross_entropy --max-update 50000 \

--warmup-updates 4000 --warmup-init-lr '1e-07' \

--adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer

Before preprocessing, you should prepare "tokenized" dataset by jeiba as below.

这是 一个 测试 。

...

However, you should consider bpe or sentencepiece to tokenize sentence because standard tokenizer cause large vocabulary size.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/pytorch/fairseq/issues/154?email_source=notifications&email_token=AKPQGWRWNM2QDKT45LUJ26LQBRGPJA5CNFSM4E7UTUUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD26LXXA#issuecomment-515685340, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKPQGWV7MG3RLS7MKTCCHADQBRGPJANCNFSM4E7UTUUA.

tagucci commented 5 years ago

yes, beam-search is better than greedy search in most cases.

lexmen318 commented 5 years ago

OK. Thank you very much !