Closed bittik closed 6 years ago
For summarization, you should follow the instructions of IWSLT data preprocessing and training but with a summarization dataset.
can you tell me which kind of dataset could I choose when doing text summarization? I found the IWSLT data preprocessing will generate a LanguagePairDataset, so I doubt this kind of dataset is suitable for summarization task. Thx
Common datasets used include Gigaword and CNN Dailymail
Sorry, maybe I was not clear.
As above advise, I traced the IWSLT data preprocessing and training instruction, and found that fairseq chose "translation" task and LanguagePairDataset. So I am confused that if I used another dataset to do text summarization, will it still use "translation" task and LanguagePairDataset?
And another question, should I implement a new tokenizer to support Chinese text summarization? (I guess it is necessary) thx. a lot!
Yes, "translation" is the way you can train summarization models on fairseq. In summarization, your model is trained to "translate" src (article) to trg (summary). The process of preprocess and train is same as this examples/translation/README.md.
In CLI command, the process is like blow.
# preprocess
$ fairseq-preprocess --source-lang src --target-lang trg \
--trainpref data/train --validpref data/valid --testpref $data/test \
--destdir data-bin/summarization-dataset
# train
$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/summarization-dataset \
-a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s src -t trg \
--label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
--min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 50000 \
--warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer
Before preprocessing, you should prepare "tokenized" dataset by jeiba as below.
这是 一个 测试 。
...
However, you should consider bpe or sentencepiece to tokenize sentence because standard tokenizer cause large vocabulary size.
thank you very much for weekend replying.
Need I do beam-searching?
刘喜明 Liu Ximing Mobile: +86-13828701940<tel:+86-13828701940> Email: liuximing1@huawei.commailto:liuximing1@huawei.com 发件人:tagucci notifications@github.com 收件人:pytorch/fairseq fairseq@noreply.github.com 抄 送:Liuximing (Lexmen, AARC) liuximing1@huawei.com;Comment comment@noreply.github.com 时间:2019-07-27 21:45:35 主题Re: [pytorch/fairseq] summarization help (#154)
Yes, "translation" is the way you can train summarization models on fairseq. In summarization, your model is trained to "translate" src (article) to trg (summary). The process of preprocess and train is same as this examples/translation/README.mdhttps://github.com/pytorch/fairseq/tree/master/examples/translation.
In CLI command, the process is like blow.
$ fairseq-preprocess --source-lang src --target-lang trg \
--trainpref data/train --validpref data/valid --testpref $data/test \
--destdir data-bin/summarization-dataset
$ CUDA_VISIBLE_DEVICES=0 fairseq-train data-bin/summarization-dataset \
-a transformer_iwslt_de_en --optimizer adam --lr 0.0005 -s src -t trg \
--label-smoothing 0.1 --dropout 0.3 --max-tokens 4000 \
--min-lr '1e-09' --lr-scheduler inverse_sqrt --weight-decay 0.0001 \
--criterion label_smoothed_cross_entropy --max-update 50000 \
--warmup-updates 4000 --warmup-init-lr '1e-07' \
--adam-betas '(0.9, 0.98)' --save-dir checkpoints/transformer
Before preprocessing, you should prepare "tokenized" dataset by jeiba as below.
这是 一个 测试 。
...
However, you should consider bpe or sentencepiece to tokenize sentence because standard tokenizer cause large vocabulary size.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/pytorch/fairseq/issues/154?email_source=notifications&email_token=AKPQGWRWNM2QDKT45LUJ26LQBRGPJA5CNFSM4E7UTUUKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD26LXXA#issuecomment-515685340, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AKPQGWV7MG3RLS7MKTCCHADQBRGPJANCNFSM4E7UTUUA.
yes, beam-search is better than greedy search in most cases.
OK. Thank you very much !
how do i run summarization using fairseq and in which format is data required ?