abisee / pointer-generator

Code for the ACL 2017 paper "Get To The Point: Summarization with Pointer-Generator Networks"
Other
2.17k stars 812 forks source link

Tried running it on random internet news articles. Results look more extractive than abstractive? #49

Open anubhavmax opened 6 years ago

anubhavmax commented 6 years ago

Hi Abigail .. I was trying to run the code using the already existing training model that was uploaded, as I do not have a powerful enough machine to train. I believe the vocab size is set to 50000 in the code. After running it through multiple news articles on the internet, I found the results to be extractive. Didn't really encounter any situation where a new word was generated for summarizing. Am I missing something in the settings? Could you please let me know where the gap in my understanding lies?

abisee commented 6 years ago

Hi @anubhavmax, the same question has been asked here.

Yes - the pointer-generator model produces mostly extractive summaries. This is discussed in section 7.2 of the paper. It is the main area for future work!

Sharathnasa commented 6 years ago

@anubhavmax Hi, How did you manage to run on your own data? could you please shed some light.

Thanks, Sharath

alkanen commented 6 years ago

@Sharathnasa, you need to run the text through the Stanford tokenizer Java program first in order to create a token list file to feed to the network.

Basically, in Linux, you run cat normal_text.txt | java edu.stanford.nlp.process.PTBTokenizer -preserveLines

And it will print a tokenized version of the text, which you need to save to a new file. That file is then fed into the pointer generator network with the "--data_path=" argument and "--mode=decode".

Sharathnasa commented 6 years ago

@alkanen Thanks a lot man!. I will give a try. Text in the sense if i only pass the entire article without a abstract, it will work fine right?

or

Should i need to Process into .bin and vocab files as explained in cnn-daily repo? and one more thing, how is that url and stories 1-1 mapping is done, if i need to do so how to proceed?

alkanen commented 6 years ago

@Sharathnasa text as in the entire article without an abstract, yes. That will create a bin file with a single article in it. Use the vocab file you already have from the CNN training set, it doesn't make much sense creating a new one based on a single article, and unless I misremember it will also break everything because the network will have trained on a particular vocab and that one needs to be used.

I'm afraid I never looked into the URL/stories mapping since that wasn't relevant for the work I did, so I can't help you there.

Sharathnasa commented 6 years ago

@alkanen Thanks once again man. When i try to run as you mentioned, i'm getting an error as below

vi womendriver.text | java edu.stanford.nlp.process.PTBTokenizer -preserveLines Vim: Warning: Output is not to a terminal Untokenizable: (U+1B, decimal: 27)

Would you please pass on the script if you have?

alkanen commented 6 years ago

@Sharathnasa you can't pipe vi into java, use cat to pipe the contents of the text file into java

Sharathnasa commented 6 years ago

@alkanen Ok, my bad. Thanks once again. After performing tokenization(which i need to save), should i need to make_datafile.py code to generate .bin files?

alkanen commented 6 years ago

Nope, just use the old vocab file used for training, and the file created by tokenization as input to the model: python pointer-generator/run_summarization.py --log_root=<some path with trained models in it> --exp_name=<the name of your trained model> --vocab_path=<your old vocab file> --mode=decode --data_path=<the file generated by tokenizer>

Sharathnasa commented 6 years ago

@alkanen did you took a look at this https://github.com/abisee/pointer-generator/issues/51

alkanen commented 6 years ago

No, anything in particular there you mean I should be aware of?

I never had the need to summarize multiple texts at once, so I haven't looked into that use case at all.

Sharathnasa commented 6 years ago

@alkanen Nothing in particular, just wanted to let you know the command he suggested to run.

One more query i have:

  1. Repo says input should be in the form of .bin files, but the tokenization we did is in the form of .bin format, will the network run?
  2. Whatever you had suggested is to run single article?
Sharathnasa commented 6 years ago

hi @alkanen when i run the below command

python3 pointer-generator/run_summarization.py --mode=decode --data_path=/Users/setup/text_abstraction/cnn-dailymail/finishedfiles/chunked/train* --vocab_path=/Users/setup/text_abstraction/finished_files/vocab --log_root=/Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train --exp_name="model-238410.data-00000-of-00001" --coverage=1 --single_pass=1 --max_enc_steps=500 --max_dec_steps=200 --min_dec_steps=100

Im getting the logs as below INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs... INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs... INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs... INFO:tensorflow:Failed to load checkpoint from /Users/setup/text_abstraction/pointer-generator/pretrained_model_tf1.2.1/train/model-238410.data-00000-of-00001/train. Sleeping for 10 secs..

Where is it gone wrong?

dondon2475848 commented 6 years ago

Hi @Sharathnasa You can clone below repository: https://github.com/dondon2475848/make_datafiles_for_pgn Run

python make_datafiles.py  ./stories  ./output

It processes your test data into the binary format

glalwani2 commented 6 years ago

@dondon2475848 I tried your repo with a sample txt file under stories folder and the .bin files didnt get created only tokenized file did. I am not sure why

dondon2475848 commented 6 years ago

Do you put xxx.txt under stories folder ? Maybe you can try xxx.story. format like below:

test1.story

MOSCOW, Russia (CNN) -- Russian space officials say the crew of the Soyuz space ship is resting after a rough ride back to Earth.

A South Korean bioengineer was one of three people on board the Soyuz capsule.

The craft carrying South Korea's first astronaut landed in northern Kazakhstan on Saturday, 260 miles (418 kilometers) off its mark, they said.

Mission Control spokesman Valery Lyndin said the condition of the crew -- South Korean bioengineer Yi So-yeon, American astronaut Peggy Whitson and Russian flight engineer Yuri Malenchenko -- was satisfactory, though the three had been subjected to severe G-forces during the re-entry.

Search helicopters took 25 minutes to find the capsule and determine that the crew was unharmed.

Officials said the craft followed a very steep trajectory that subjects the crew to gravitational forces of up to 10 times those on Earth.

Interfax reported that the spacecraft's landing was rough.

This is not the first time a spacecraft veered from its planned trajectory during landing.

In October, the Soyuz capsule landed 70 kilometers from the planned area because of a damaged control cable. The capsule was carrying two Russian cosmonauts and the first Malaysian astronaut. E-mail to a friend

@highlight

Soyuz capsule lands hundreds of kilometers off-target

@highlight

Capsule was carrying South Korea's first astronaut

@highlight

Landing is second time Soyuz capsule has gone awry
victorherbemontagne commented 6 years ago

@Sharathnasa I don't know if you still have this issue but I think I figure it out. I had the same issue with the TraceBack, do you run on TensorFlow 1.5? You can check at my repo, I fork the code of @becxer in python3 and modify it for Tensorflow 1.5 (still loading the TF 1.2.1 model presented in @abisee repo). Not so much work, TF 1.5 has a really bad support for tf.tags so I modify the code to make it works. If you look at your error, go to .utils.py and print the exception in the load_checkpoint() function.For me it came from the fact that 4 words in the vocab_meta.tsv were not added to the vocab so I had a shape issue, I made a small correction in the code to format the considered words and to add them to the vocab and it worked like a charm. You can check my code, tell me if there is a bug or anything, I will work it out!