facebookresearch / fairseq-lua

Facebook AI Research Sequence-to-Sequence Toolkit
Other
3.74k stars 616 forks source link

How to use this for summarization? #16

Closed shahbazsyed closed 6 years ago

shahbazsyed commented 7 years ago

Hi, I am trying to test this model for summarization task (en->en) . I am trying to preprocess my articles and summaries using fairseq preprocess. I get the following error

image

The command i use for preprocessing is

fairseq preprocess -sourcelang articles -targetlang summaries -trainpref train -validpref valid -testpref test -destdir data-bin/summarize

I have the following tokenized files in my directory : train.articles , train.summaries , valid.articles, valid.summaries, test.articles, test.summaries -> each containing a sentence per line

Can someone kindly let me know what am I missing here ?

jgehring commented 7 years ago

Hi, are you using Lua 5.2? This looks like the following issue in tds: https://github.com/torch/tds/issues/25. As there is no fix for this yet, I suggest switching to luajit for the time being.

shahbazsyed commented 7 years ago

@jgehring I switched back to luajit and I don't get this problem anymore. However the command is not able to find the files for -trainpref . Is the argument trainpref a path to the train folder which contains 2 files named as train.articles , train.summaries ? What is the use of this pref?

michaelauli commented 7 years ago

-{train,valid,test}pref are prefixes to files which end with the arguments of -sourcelang and -targetlang, e.g., -trainpref /home/$USER/data/train -sourcelang articles -targetlang summaries will look for the two files /home/$USER/data/train.articles and /home/$USER/data/train.summaries

On 12 May 2017 at 09:09, Syed Shahbaz Ahmed notifications@github.com wrote:

@jgehring https://github.com/jgehring I switched back to luajit and I don't get this problem anymore. However the command is not able to find the files for -trainpref . Is the argument trainpref a path to the train folder which contains 2 files named as train.articles , train.summaries ? What is the use of this pref?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/fairseq/issues/16#issuecomment-301118929, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGSfppDMEZ00C5KW3pD7BJg1RqcOCXdks5r5IQqgaJpZM4NXn4_ .

loretoparisi commented 7 years ago

@jgehring It would be possibile to provide in the README a text summarization example, starting from the provided dataset? Thanks.

michaelauli commented 7 years ago

You can pre-process abstractive summarization data in the same way as machine translation data. Just follow the steps for building the IWSLT example model in the README ( https://github.com/facebookresearch/fairseq#training-a-new-model).

On 16 May 2017 at 14:32, Loreto Parisi notifications@github.com wrote:

@jgehring https://github.com/jgehring It would be possibile to provide in the README a text summarization example, starting from the provided dataset? Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/fairseq/issues/16#issuecomment-301921981, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGSfjrtbihW_b4FU1w_arQ_accdBnSMks5r6hXTgaJpZM4NXn4_ .

loretoparisi commented 7 years ago

@michaelauli Thank you. In that case I would have something like -sourcelang en -targetlang en, but there is no example dataset for text summarization (like Gigaword dataset, Daily Mail dataset, CNN dataset, etc.) at this point to run a working example, right?

michaelauli commented 7 years ago

Sure there is, see the data provided by https://github.com/facebookarchive/NAMAS

-sourcelang and -targetlang refer to file extensions, see my comment above ( https://github.com/facebookresearch/fairseq/issues/16#issuecomment-301140815 )

On 16 May 2017 at 14:48, Loreto Parisi notifications@github.com wrote:

@michaelauli https://github.com/michaelauli Thank you. In that case I would have something like -sourcelang en -targetlang en, but there is no example dataset for text summarization (like Gigaword dataset, etc.) at this point to run a working example, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/fairseq/issues/16#issuecomment-301925686, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGSfpWle5xuvuQ_2M5QFA7d0oEc48DYks5r6hm6gaJpZM4NXn4_ .

loretoparisi commented 7 years ago

@michaelauli so to be clear in the case of the Neural Attention Model for Abstractive Summarization it was trained with the

LDC2012T21 Annotated English Gigaword

So to reach a comparable BLEU score, it should be used the GigaWord I guess. Why there is not pre-trained model for this (i.e. licensing issue, etc.)? Thank you very much for your help.

michaelauli commented 7 years ago

yes, following the pre-processing in the NAMAS github project.

On 17 May 2017 at 01:32, Loreto Parisi notifications@github.com wrote:

@michaelauli https://github.com/michaelauli so to be clear in the case of the Neural Attention Model for Abstractive Summarization it was trained with the

LDC2012T21 Annotated English Gigaword

Thank you.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/fairseq/issues/16#issuecomment-302023022, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGSfvOZw3xCemhETYRwcLPwHqK_PqD7ks5r6rCEgaJpZM4NXn4_ .

jgehring commented 6 years ago

Closing due to inactivity; please re-open if necessary.