facebookresearch / UnsupervisedQA

Unsupervised Question answering via Cloze Translation
Other
218 stars 60 forks source link

Request help #4

Closed lx385095967 closed 5 years ago

lx385095967 commented 5 years ago

When I run the sample code, the "question_text" is generated by a bunch of garbled characters.

question_text='Aires Aires scoreline 璟kindergarscoreline Gemeinscoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline headers bbs scoreline headers 灘scoreline 灘scoreline 璟bim溾 brighter headers bbs Persons Persons Persons Persons Aires scoreline headers neighbourAires 溾 electric erian neighbourAires Persons Persons Persons neighbourAires Persons Persons Persons Persons Aires Persons Persons Persons Persons Persons neighbourAires Persons neighbourAires brighter brighter Persons Persons Persons Persons Persons Persons neighbourAires neighbourAires neighbourAires neighbourAires neighbourAires neighbourAires neighbourAires Persons Persons Persons Persons neighbourAires neighbourAires neighbourAires neighbourAires neighbourAires brighter neighbourAires neighbourAires erian erian erian erian erian erian erian erian erian erian erian neighbourAires headers Painting Mocscoreline Removal headers Painting Persons Painting gramPersons Persons Persons Persons Persons Persons TomneighbourAires erian neighbourAires erian erian erian erian erian erian erian erian erian erian neighbourAires 溾 Painting neighbourAires headers neighbourAires headers neighbourAires headers Removal headers neighbourAires brighter ayo Painting butter erian erian erian erian erian erian erian erian erian erian erian erian erian erian erian erian'

patrick-s-h-lewis commented 5 years ago

Hi lx385095967

I'll need some more info. Could you let me know exactly what command you ran to get that generated output? and what text you ran the program on? also please paste stdout

Thanks, Patrick

lx385095967 commented 5 years ago

Thanks for your reply. The input file: example_input.txt Command:

python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output \
    --input_file_format "txt" \
    --output_file_format "jsonl,squad" \
    --translation_method unmt \
    --use_named_entity_clozes \
    --use_subclause_clozes \
    --use_wh_heuristic 

I wanted to see what the output of the function was before.

# translate clozes to questions
    clozes_with_questions = get_questions_for_clozes(
        clozes,
        args.use_subclause_clozes,
        args.use_named_entity_clozes,
        args.use_wh_heuristic,
        args.translation_method
    )

So i got this: [Cloze(cloze_id='65198e9ef907d4d60d150370784e364f3b8d609a_0', paragraph=Paragraph(paragraph_id='1a07f8499d6d166504205af8e090e99391d43fa4', text='Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champions Denver Broncos defeated the National Football Conference (NFC) champions Carolina Panthers, 24–10. The game was played on February 7, 2016, at Levi\'s Stadium in Santa Clara, California, in the Bay Area. As this was the 50th Super Bowl game, the league emphasized the "golden anniversary" with various gold-themed initiatives during the 2015 season, as well as suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so the logo could prominently feature the Arabic numerals 50.'), source_text='to determine the champion of the National Football League (NFL) for the 2015 season', source_start=44, cloze_text='to determine the champion of IDENTITYMASK (NFL) for the 2015 season', answer_text='the National Football League', answer_start=29, constituency_parse=None, root_label=None, answer_type='ORG', question_text='FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC killed FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed killed killed killed FCC killed FCC killed FCC killed FCC killed FCC killed FCC killed killed killed killed FCC killed FCC killed FCC killed FCC killed killed killed FCC killed FCC killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed FCC killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed killed'),.....................

patrick-s-h-lewis commented 5 years ago

Can you paste in what gets printed to stdout?

lx385095967 commented 5 years ago

There is no output.

==================================================
Dumping results
==================================================
Exported 0 instances to example_output.unsupervised_qa.jsonl
Exported 0 instances to example_output.squad.json
==================================================
Complete
==================================================
patrick-s-h-lewis commented 5 years ago

Hi, please confirm for me that you are running

$ python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output \
    --input_file_format "txt" \
    --output_file_format "jsonl,squad" \
    --translation_method unmt \
    --use_named_entity_clozes \
    --use_subclause_clozes \
    --use_wh_heuristic 

from the shell in the home directory of this repo.

There should be much more printed than what you just pasted, e.g. for me I get:

$ python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output  \
    --input_file_format "txt" \
    --output_file_format "jsonl,squad" \
    --translation_method unmt \
    --use_named_entity_clozes \
    --use_subclause_clozes \
    --use_wh_heuristic 

==================================================
Parsed 4 paragraphs from example_input.txt
==================================================
Running Constituency Parsing: 100%|███████████████| 3/3 [00:02<00:00,  1.24it/s]
==================================================
37 Cloze questions extracted for Translation
==================================================
Tokenizer Version 1.1
Language: en
Number of threads: 16
Loading vocabulary from ./data/subclause_ne_wh_heuristic/vocab.cloze-question.60000 ...
Read 170347158 words (68672 unique) from vocabulary file.
Loading codes from ./data/subclause_ne_wh_heuristic/bpe_codes ...
Read 60000 codes from the codes file.
Loading vocabulary from /tmp/tmpuefcp91l/dev.cloze.tok ...
Read 539 words (96 unique) from text file.
Applying BPE to /tmp/tmpuefcp91l/dev.cloze.tok ...
Modified 539 words from text file.
INFO - 08/19/19 08:06:26 - 0:00:00 - Read 68686 words from the vocabulary file.

Saving the data to /tmp/tmpuefcp91l/dev.cloze.tok.bpe.pth ...
INFO - 08/19/19 08:06:26 - 0:00:00 - 546 words (68686 unique) in 37 sentences.
INFO - 08/19/19 08:06:26 - 0:00:00 - 3 unknown words (1 unique), covering 0.55% of the data.
INFO - 08/19/19 08:06:26 - 0:00:00 - '@@: 3
==================================================
Dumping results
==================================================
Exported 37 instances to example_output.unsupervised_qa.jsonl
Exported 37 instances to example_output.squad.json
==================================================
Complete
==================================================
[INFO/MainProcess] process shutting down

If the "Parsed {N} paragraphs from {input_file}" isnt being printed, it suggests you're trying to run this code interactively, which makes it hard to diagnose your problem - what happens when you just run the normal command?

lx385095967 commented 5 years ago

Sorry to bother you many times. Will it be a problem when reading the model? When running here:

trainer.reload_checkpoint()

# initialize trainer / reload checkpoint / initialize evaluator
    trainer = TrainerMT(encoder, decoder, discriminator, lm, data, params)#<src.trainer.TrainerMT object at 0x7efc52c9fa90>

    trainer.reload_checkpoint()

    trainer.test_sharing()  # check parameters sharing
    evaluator = EvaluatorMT(trainer, data, params)

I can't get the 'checkpoint.pth' file.

def reload_checkpoint(self):
        """
        Reload a checkpoint if we find one.
        """
        # reload checkpoint
        checkpoint_path = os.path.join(self.params.dump_path, 'checkpoint.pth')
        if not os.path.isfile(checkpoint_path):
            return
        logger.warning('Reloading checkpoint from %s ...' % checkpoint_path)
        checkpoint_data = torch.load(checkpoint_path)
        self.encoder = checkpoint_data['encoder']
        self.decoder = checkpoint_data['decoder']
        self.discriminator = checkpoint_data['discriminator']
        self.lm = checkpoint_data['lm']
        self.enc_optimizer = checkpoint_data['enc_optimizer']
        self.dec_optimizer = checkpoint_data['dec_optimizer']
        self.dis_optimizer = checkpoint_data['dis_optimizer']
        self.lm_optimizer = checkpoint_data['lm_optimizer']
        self.epoch = checkpoint_data['epoch']
        self.n_total_iter = checkpoint_data['n_total_iter']
        self.best_metrics = checkpoint_data['best_metrics']
        self.best_stopping_criterion = checkpoint_data['best_stopping_criterion']
        self.model_opt = {
            'enc': (self.encoder, self.enc_optimizer),
            'dec': (self.decoder, self.dec_optimizer),
            'dis': (self.discriminator, self.dis_optimizer),
            'lm': (self.lm, self.lm_optimizer),
        }
        logger.warning('Checkpoint reloaded. Resuming at epoch %i ...' % self.epoch)

Than I changed the _checkpointpath, I read _subclause_ne_whheuristic/periodic-20.pth out:

def reload_checkpoint(self):
        """
        Reload a checkpoint if we find one.
        """
        # reload checkpoint
        # checkpoint_path = os.path.join(self.params.dump_path, 'checkpoint.pth')
        checkpoint_path = self.params.reload_model
        if not os.path.isfile(checkpoint_path):
            return
        logger.warning('Reloading checkpoint from %s ...' % checkpoint_path)
        # checkpoint_data = torch.load(checkpoint_path)
        checkpoint_data = torch.load(checkpoint_path,map_location='cpu')
        self.encoder = checkpoint_data['encoder']
        self.decoder = checkpoint_data['decoder']
        self.discriminator = checkpoint_data['discriminator']
        self.lm = checkpoint_data['lm']
        self.enc_optimizer = checkpoint_data['enc_optimizer']
        self.dec_optimizer = checkpoint_data['dec_optimizer']
        self.dis_optimizer = checkpoint_data['dis_optimizer']
        self.lm_optimizer = checkpoint_data['lm_optimizer']
        self.epoch = checkpoint_data['epoch']
        self.n_total_iter = checkpoint_data['n_total_iter']
        self.best_metrics = checkpoint_data['best_metrics']
        self.best_stopping_criterion = checkpoint_data['best_stopping_criterion']
        self.model_opt = {
            'enc': (self.encoder, self.enc_optimizer),
            'dec': (self.decoder, self.dec_optimizer),
            'dis': (self.discriminator, self.dis_optimizer),
            'lm': (self.lm, self.lm_optimizer),
        }
        logger.warning('Checkpoint reloaded. Resuming at epoch %i ...' % self.epoch)

The error is :

==================================================
Parsed 4 paragraphs from example_input.txt
==================================================
Running Constituency Parsing:   0%|                       | 0/3 [00:00<?, ?it/s]WARNING:allennlp.data.fields.sequence_label_field:Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary.  See documentation for `non_padded_namespaces` parameter in Vocabulary.
Running Constituency Parsing: 100%|███████████████| 3/3 [00:05<00:00,  1.93s/it]
==================================================
37 Cloze questions extracted for Translation
==================================================
Tokenizer Version 1.1
Language: en
Number of threads: 16
Loading vocabulary from /home/leexin/Documents/UnsupervisedQA-master/unsupervisedqa/../data/subclause_ne_wh_heuristic/vocab.cloze-question.60000 ...
Read 170347158 words (68672 unique) from vocabulary file.
Loading codes from /home/leexin/Documents/UnsupervisedQA-master/unsupervisedqa/../data/subclause_ne_wh_heuristic/bpe_codes ...
Read 60000 codes from the codes file.
Loading vocabulary from /tmp/tmpzk8abztn/dev.cloze.tok ...
Read 539 words (96 unique) from text file.
Applying BPE to /tmp/tmpzk8abztn/dev.cloze.tok ...
Modified 539 words from text file.
INFO - 08/26/19 10:01:12 - 0:00:00 - Read 68686 words from the vocabulary file.

Saving the data to /tmp/tmpzk8abztn/dev.cloze.tok.bpe.pth ...
INFO - 08/26/19 10:01:12 - 0:00:00 - 546 words (68686 unique) in 37 sentences.
INFO - 08/26/19 10:01:12 - 0:00:00 - 3 unknown words (1 unique), covering 0.55% of the data.
INFO - 08/26/19 10:01:12 - 0:00:00 - '@@: 3
/home/leexin/anaconda3/envs/uqa37/lib/python3.7/site-packages/torch/nn/functional.py:52: UserWarning: size_average and reduce args will be deprecated, please use reduction='elementwise_mean' instead.
  warnings.warn(warning.format(ret))
WARNING:root:Reloading checkpoint from /home/leexin/Documents/UnsupervisedQA-master/unsupervisedqa/../data/subclause_ne_wh_heuristic/periodic-20.pth ...
/home/leexin/anaconda3/envs/uqa37/lib/python3.7/site-packages/torch/serialization.py:425: SourceChangeWarning: source code of class 'src.model.transformer.TransformerEncoder' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
/home/leexin/anaconda3/envs/uqa37/lib/python3.7/site-packages/torch/serialization.py:425: SourceChangeWarning: source code of class 'src.model.transformer.TransformerDecoder' has changed. you can retrieve the original source code by accessing the object's source attribute or set `torch.nn.Module.dump_patches = True` and use the patch tool to revert the changes.
  warnings.warn(msg, SourceChangeWarning)
Traceback (most recent call last):
  File "/home/leexin/anaconda3/envs/uqa37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/leexin/anaconda3/envs/uqa37/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/leexin/Documents/UnsupervisedQA-master/unsupervisedqa/generate_synthetic_qa_data.py", line 168, in <module>
    generate_synthetic_training_data(args)#call function
  File "/home/leexin/Documents/UnsupervisedQA-master/unsupervisedqa/generate_synthetic_qa_data.py", line 114, in generate_synthetic_training_data
    args.translation_method
  File "/home/leexin/Documents/UnsupervisedQA-master/unsupervisedqa/generate_synthetic_qa_data.py", line 75, in get_questions_for_clozes
    clozes, subclause_clozes,  ne_answers,  wh_heuristic)
  File "/home/leexin/Documents/UnsupervisedQA-master/unsupervisedqa/unmt_translation.py", line 325, in get_unmt_questions_for_clozes
    checkpoint_path
  File "/home/leexin/Documents/UnsupervisedQA-master/unsupervisedqa/unmt_translation.py", line 276, in perform_translation
    trainer.reload_checkpoint()
  File "/home/leexin/Documents/UnsupervisedQA-master/unsupervisedqa/../UnsupervisedMT/NMT/src/trainer.py", line 822, in reload_checkpoint
    self.encoder = checkpoint_data['encoder']
KeyError: 'encoder'
[INFO/MainProcess] process shutting down
patrick-s-h-lewis commented 5 years ago

Hi,

it looks like the system isn't loading the correct checkpoint.pth file, but it does seem to find find the bpe codes file and vocab fine, so perhaps something went wrong in your model data download (maybe the download was interrupted, or the download corrupted some data). The checkpoint path that is being loaded by your model doenst have the correct keys in its state_dict.

Than I changed the checkpoint_path, I read subclause_ne_wh_heuristic/periodic-20.pth out

I dont fully understand this part - did you change the paths to the models to load? The path to the checkpoint should be {repo home directory}/data/subclause_ne_wh_heuristic/periodic-20.pth but this should all be handled for you by the codebase, you shouldnt need to specify these paths yourself, it should work out of the box, suggesting there is an error in your installation of the repository.

Try the following:

Can you also let me know what Operating system you are using?

lx385095967 commented 5 years ago

My network status is not very good here.So I downloaded the model manually.I will try it according to your suggestion. The Operating system is : Ubuntu 18.04.3 LTS

patrick-s-h-lewis commented 5 years ago

closing due to lack of activity