facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.49k stars 2.09k forks source link

Unable to train on my own document corpus #3806

Closed lukesalamone closed 3 years ago

lukesalamone commented 3 years ago

Bug description I am following the instructions for indexing my own documents with a FAISS index: https://www.parl.ai/docs/agent_refs/rag.html#generating-your-own-faiss-index

Reproduction steps

I generated the embeddings with the following:

/usr/bin/python generate_dense_embeddings.py -mf zoo:hallucination/bart_rag_turn_dtt/model --dpr-model True --passages-file passages.tsv --outfile embeddings/out --num-shards 1 --shard-id 0 -bs 32

Then I indexed the embeddings

python index_dense_embeddings.py --retriever-embedding-size 768  --embeddings-dir embeddings --embeddings-name embeddings/out

Then I tried to to run the model in interactive mode:

parlai interactive -mf zoo:hallucination/bart_rag_turn_dtt/model --path-to-index  embeddings/IVF4096_HNSW128__PQ128.index --path-to-dense-embeddings embeddings/out --path-to-dpr-passages passages.tsv --retriever-embedding-size 768

Expected behavior I expect to be able to run the model in interactive mode without errors.

Logs

/usr/lib/python3/dist-packages/requests/__init__.py:89: RequestsDependencyWarning: urllib3 (1.26.6) or chardet (3.0.4) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({}) doesn't match a supported "
15:14:17 | Overriding opt["path_to_index"] to /root/ParlAI/parlai/agents/rag/scripts/embeddings/IVF4096_HNSW128__PQ128.index (previously: zoo:hallucination/wiki_index_compressed/compressed_pq)
15:14:17 | Overriding opt["path_to_dense_embeddings"] to /root/ParlAI/parlai/agents/rag/scripts/embeddings/out (previously: None)
15:14:17 | Overriding opt["path_to_dpr_passages"] to /root/ParlAI/parlai/agents/rag/scripts/passages.tsv (previously: zoo:hallucination/wiki_passages/psgs_w100.tsv)
15:14:17 | Using CUDA
15:14:17 | loading dictionary from /root/ParlAI/data/models/hallucination/bart_rag_turn_dtt/model.dict
15:14:17 | num words = 50264
15:14:17 | Rag: full interactive mode on.
15:14:18 | Loading index from /root/ParlAI/parlai/agents/rag/scripts/embeddings/IVF4096_HNSW128__PQ128.index
15:14:18 | Loaded index of type <faiss.swigfaiss.IndexIVFPQ; proxy of <Swig Object of type 'faiss::IndexIVFPQ *' at 0x7f2cf1af97e0> > and size 604517
15:14:18 | Reading data from: /root/ParlAI/parlai/agents/rag/scripts/passages.tsv
0it [00:00, ?it/s]
15:14:18 | Exception: invalid load
15:14:18 | Error in loading csv; loading via readlines
604518it [00:00, 1102563.53it/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
15:14:32 | Total parameters: 515,177,984 (514,784,768 trainable)
15:14:32 | Loading existing model params from /root/ParlAI/data/models/hallucination/bart_rag_turn_dtt/model
15:14:32 | Opt:
15:14:32 |     activation: gelu
15:14:32 |     adafactor_eps: '[1e-30, 0.001]'
15:14:32 |     adam_eps: 1e-08
15:14:32 |     add_p1_after_newln: False
15:14:32 |     allow_missing_init_opts: False
15:14:32 |     attention_dropout: 0.0
15:14:32 |     batchsize: 8
15:14:32 |     beam_block_full_context: True
15:14:32 |     beam_block_list_filename: None
15:14:32 |     beam_block_ngram: 3
15:14:32 |     beam_context_block_ngram: -1
15:14:32 |     beam_delay: 30
15:14:32 |     beam_length_penalty: 0.65
15:14:32 |     beam_min_length: 20
15:14:32 |     beam_size: 3
15:14:32 |     betas: '[0.9, 0.999]'
15:14:32 |     bpe_add_prefix_space: None
15:14:32 |     bpe_debug: False
15:14:32 |     bpe_dropout: None
15:14:32 |     bpe_merge: None
15:14:32 |     bpe_vocab: None
15:14:32 |     candidates: inline
15:14:32 |     cap_num_predictions: 100
15:14:32 |     codes_attention_num_heads: 4
15:14:32 |     codes_attention_type: basic
15:14:32 |     compressed_indexer_factory: None
15:14:32 |     compressed_indexer_gpu_train: False
15:14:32 |     compressed_indexer_nprobe: 10
15:14:32 |     compute_tokenized_bleu: False
15:14:32 |     data_parallel: False
15:14:32 |     datapath: /root/ParlAI/data
15:14:32 |     datatype: train
15:14:32 |     delimiter: '\n'
15:14:32 |     dict_class: parlai.core.dict:DictionaryAgent
15:14:32 |     dict_endtoken: __end__
15:14:32 |     dict_file: /root/ParlAI/data/models/hallucination/bart_rag_turn_dtt/model.dict
15:14:32 |     dict_initpath: None
15:14:32 |     dict_language: english
15:14:32 |     dict_loaded: True
15:14:32 |     dict_lower: False
15:14:32 |     dict_max_ngram_size: -1
15:14:32 |     dict_maxtokens: -1
15:14:32 |     dict_minfreq: 0
15:14:32 |     dict_nulltoken: __null__
15:14:32 |     dict_starttoken: __start__
15:14:32 |     dict_textfields: text,labels
15:14:32 |     dict_tokenizer: gpt2
15:14:32 |     dict_unktoken: __unk__
15:14:32 |     display_add_fields: 
15:14:32 |     display_examples: False
15:14:32 |     display_prettify: False
15:14:32 |     download_path: None
15:14:32 |     dpr_model_file: zoo:hallucination/multiset_dpr/hf_bert_base.cp
15:14:32 |     dpr_num_docs: 0
15:14:32 |     dropout: 0.1
15:14:32 |     dynamic_batching: None
15:14:32 |     embedding_projection: random
15:14:32 |     embedding_size: 1024
15:14:32 |     embedding_type: random
15:14:32 |     embeddings_scale: False
15:14:32 |     encode_candidate_vecs: True
15:14:32 |     encode_candidate_vecs_batchsize: 256
15:14:32 |     eval_candidates: inline
15:14:32 |     ffn_size: 4096
15:14:32 |     fixed_candidate_vecs: reuse
15:14:32 |     fixed_candidates_path: None
15:14:32 |     force_fp16_tokens: True
15:14:32 |     fp16: True
15:14:32 |     fp16_impl: mem_efficient
15:14:32 |     generation_model: bart
15:14:32 |     gold_knowledge_passage_key: checked_sentence
15:14:32 |     gold_knowledge_title_key: title
15:14:32 |     gpu: -1
15:14:32 |     gradient_clip: 0.1
15:14:32 |     hide_labels: False
15:14:32 |     history_add_global_end_token: None
15:14:32 |     history_reversed: False
15:14:32 |     history_size: -1
15:14:32 |     hnsw_ef_construction: 200
15:14:32 |     hnsw_ef_search: 128
15:14:32 |     hnsw_indexer_scalar_quantize: False
15:14:32 |     hnsw_indexer_store_n: 512
15:14:32 |     ignore_bad_candidates: False
15:14:32 |     image_cropsize: 224
15:14:32 |     image_mode: raw
15:14:32 |     image_size: 256
15:14:32 |     indexer_buffer_size: 65536
15:14:32 |     indexer_type: compressed
15:14:32 |     inference: beam
15:14:32 |     init_fairseq_model: None
15:14:32 |     init_model: /private/home/kshuster/ParlAI/data/models/bart/bart_large/model
15:14:32 |     init_opt: None
15:14:32 |     interactive_candidates: fixed
15:14:32 |     interactive_mode: True
15:14:32 |     interactive_task: True
15:14:32 |     invsqrt_lr_decay_gamma: -1
15:14:32 |     is_debug: False
15:14:32 |     label_truncate: 128
15:14:32 |     learn_embeddings: True
15:14:32 |     learn_positional_embeddings: True
15:14:32 |     learningrate: 1e-05
15:14:32 |     local_human_candidates_file: None
15:14:32 |     log_keep_fields: all
15:14:32 |     loglevel: info
15:14:32 |     lr_scheduler: reduceonplateau
15:14:32 |     lr_scheduler_decay: 0.5
15:14:32 |     lr_scheduler_patience: 1
15:14:32 |     max_doc_token_length: 256
15:14:32 |     memory_attention: sqrt
15:14:32 |     min_doc_token_length: 64
15:14:32 |     model: rag
15:14:32 |     model_file: /root/ParlAI/data/models/hallucination/bart_rag_turn_dtt/model
15:14:32 |     model_parallel: True
15:14:32 |     momentum: 0
15:14:32 |     multitask_weights: stochastic
15:14:32 |     n_decoder_layers: 12
15:14:32 |     n_docs: 5
15:14:32 |     n_encoder_layers: 12
15:14:32 |     n_extra_positions: 0
15:14:32 |     n_heads: 16
15:14:32 |     n_layers: 2
15:14:32 |     n_positions: 1024
15:14:32 |     n_segments: 0
15:14:32 |     nesterov: True
15:14:32 |     no_cuda: False
15:14:32 |     normalize_sent_emb: False
15:14:32 |     nus: [0.7]
15:14:32 |     optimizer: mem_eff_adam
15:14:32 |     outfile: 
15:14:32 |     output_conversion_path: None
15:14:32 |     output_scaling: 1.0
15:14:32 |     override: "{'path_to_index': '/root/ParlAI/parlai/agents/rag/scripts/embeddings/IVF4096_HNSW128__PQ128.index', 'path_to_dense_embeddings': '/root/ParlAI/parlai/agents/rag/scripts/embeddings/out', 'path_to_dpr_passages': '/root/ParlAI/parlai/agents/rag/scripts/passages.tsv', 'retriever_embedding_size': 768}"
15:14:32 |     parlai_home: /private/home/kshuster/ParlAI
15:14:32 |     path_to_dense_embeddings: /root/ParlAI/parlai/agents/rag/scripts/embeddings/out
15:14:32 |     path_to_dpr_passages: /root/ParlAI/parlai/agents/rag/scripts/passages.tsv
15:14:32 |     path_to_index: /root/ParlAI/parlai/agents/rag/scripts/embeddings/IVF4096_HNSW128__PQ128.index
15:14:32 |     person_tokens: False
15:14:32 |     poly_attention_num_heads: 4
15:14:32 |     poly_attention_type: basic
15:14:32 |     poly_faiss_model_file: None
15:14:32 |     poly_n_codes: 64
15:14:32 |     poly_score_initial_lambda: 1.0
15:14:32 |     polyencoder_init_model: wikito
15:14:32 |     polyencoder_type: codes
15:14:32 |     print_docs: False
15:14:32 |     query_model: bert
15:14:32 |     rag_model_type: turn
15:14:32 |     rag_query_truncate: None
15:14:32 |     rag_retriever_query: full_history
15:14:32 |     rag_retriever_type: dpr
15:14:32 |     rag_turn_discount_factor: 1.0
15:14:32 |     rag_turn_marginalize: doc_then_turn
15:14:32 |     rag_turn_n_turns: 2
15:14:32 |     rank_candidates: False
15:14:32 |     rank_top_k: -1
15:14:32 |     reduction_type: mean
15:14:32 |     regret: False
15:14:32 |     regret_intermediate_maxlen: 32
15:14:32 |     regret_model_file: None
15:14:32 |     relu_dropout: 0.0
15:14:32 |     repeat_blocking_heuristic: True
15:14:32 |     retriever_debug_index: None
15:14:32 |     retriever_embedding_size: 768
15:14:32 |     return_cand_scores: False
15:14:32 |     save_format: conversations
15:14:32 |     share_encoders: True
15:14:32 |     share_word_embeddings: True
15:14:32 |     single_turn: False
15:14:32 |     skip_generation: False
15:14:32 |     special_tok_lst: None
15:14:32 |     split_lines: False
15:14:32 |     starttime: Mar30_00-08
15:14:32 |     t5_dropout: 0.0
15:14:32 |     t5_generation_config: None
15:14:32 |     t5_model_arch: t5-base
15:14:32 |     t5_model_parallel: False
15:14:32 |     task: wizard_of_wikipedia
15:14:32 |     temperature: 1.0
15:14:32 |     text_truncate: 512
15:14:32 |     tfidf_max_doc_paragraphs: -1
15:14:32 |     tfidf_model_path: None
15:14:32 |     thorough: False
15:14:32 |     topk: 10
15:14:32 |     topp: 0.9
15:14:32 |     train_predict: False
15:14:32 |     truncate: 512
15:14:32 |     update_freq: 1
15:14:32 |     use_memories: False
15:14:32 |     use_reply: label
15:14:32 |     variant: bart
15:14:32 |     verbose: False
15:14:32 |     warmup_rate: 0.0001
15:14:32 |     warmup_updates: 0
15:14:32 |     weight_decay: None
15:14:32 |     wrap_memory_encoder: False
15:14:32 | Current ParlAI commit: 831bcc7cc11a28aa8bb58bd7a14f2d9637fd6dbb
Enter [DONE] if you want to end the episode, [EXIT] to quit.
15:14:33 | creating task(s): interactive
Enter Your Message: What is the capital of Spain?
/usr/local/lib/python3.8/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)
Traceback (most recent call last):
  File "/usr/local/bin/parlai", line 11, in <module>
    load_entry_point('parlai', 'console_scripts', 'parlai')()
  File "/root/ParlAI/parlai/__main__.py", line 14, in main
    superscript_main()
  File "/root/ParlAI/parlai/core/script.py", line 325, in superscript_main
    return SCRIPT_REGISTRY[cmd].klass._run_from_parser_and_opt(opt, parser)
  File "/root/ParlAI/parlai/core/script.py", line 108, in _run_from_parser_and_opt
    return script.run()
  File "/root/ParlAI/parlai/scripts/interactive.py", line 118, in run
    return interactive(self.opt)
  File "/root/ParlAI/parlai/scripts/interactive.py", line 93, in interactive
    world.parley()
  File "/root/ParlAI/parlai/tasks/interactive/worlds.py", line 89, in parley
    acts[1] = agents[1].act()
  File "/root/ParlAI/parlai/core/torch_agent.py", line 2139, in act
    response = self.batch_act([self.observation])[0]
  File "/root/ParlAI/parlai/core/torch_agent.py", line 2235, in batch_act
    output = self.eval_step(batch)
  File "/root/ParlAI/parlai/core/torch_generator_agent.py", line 875, in eval_step
    beam_preds_scores, beams = self._generate(
  File "/root/ParlAI/parlai/agents/rag/rag.py", line 605, in _generate
    gen_outs = self._rag_generate(batch, beam_size, max_ts, prefix_tokens)
  File "/root/ParlAI/parlai/agents/rag/rag.py", line 644, in _rag_generate
    return self._generation_agent._generate(
  File "/root/ParlAI/parlai/core/torch_generator_agent.py", line 1165, in _generate
    n_best_beam_preds_scores = [b.get_rescored_finished() for b in beams]
  File "/root/ParlAI/parlai/core/torch_generator_agent.py", line 1165, in <listcomp>
    n_best_beam_preds_scores = [b.get_rescored_finished() for b in beams]
  File "/root/ParlAI/parlai/core/torch_generator_agent.py", line 1562, in get_rescored_finished
    assert (pred == self.eos).sum() == 1, (
AssertionError: TreeSearch returned a finalized hypo with multiple end tokens with score nan
klshuster commented 3 years ago

can you share a sample of what your passages file looks like? I can run the model just fine on my end with the default settings:

$ parlai interactive -mf zoo:hallucination/bart_rag_turn_dtt/model
.
.
.
Enter Your Message: what is the capital of spain?
[Rag]: Madrid is the capital of Spain, but there are many other major urban areas like Barcelona, Valencia, Seville, Málaga and Bilbao.
lukesalamone commented 3 years ago

@klshuster Originally I was using a custom file with different data in the TSV. But even when I change my input TSV file to contain only one line (besides the headers) I still get the same AssertionError: TreeSearch returned a finalized hypo with multiple end tokens with score nan error:

x\tChicago, officially the City of Chicago, is the most populous city in the U.S. state of Illinois, and the third most populous city in the United States, following New York and Los Angeles. With an estimated population of 2,693,976 in 2019, it is also the most populous city in the Midwestern United States and the fifth most populous city in North America. Chicago is the county seat of Cook County, the second most populous county in the U.S., while a small portion of the city's O'Hare Airport also extends into DuPage County. Chicago is the principal city of the Chicago metropolitan area, defined as either the U.S. Census Bureau's metropolitan statistical area (9.4 million people) or the combined statistical area (almost 10 million residents), often called Chicagoland. It constitutes the third most populous urban area in the United States after New York City and Los Angeles and is one of the 40 largest urban areas in the world.\tChicago

This line is repeated 9999 times in my file. I ran the embedding script and the indexing script and the issue persists.

Here's the file for reference. Github won't let me upload TSV files so I changed the extension to CSV (just change it back to TSV). chicago.csv

lukesalamone commented 3 years ago

When I print out the lines it's trying to decode I get

11:59:39 | Decoding error: tensor([1, 3, 2], device='cuda:0')
11:59:39 | Decoding error: tensor([1, 3, 2, 3, 2], device='cuda:0')
11:59:39 | Decoding error: tensor([1, 3, 2, 3, 2, 4, 2], device='cuda:0')

I think 1 is start of sequence, 2 is end of sequence, and 3 is UNK. So it seems like the model is basically generating nonsense.

klshuster commented 3 years ago

ok, I am able to repro your issue so I'm looking into it now

klshuster commented 3 years ago

ok, looks like scores for the retrieved documents are NaNs; this probably indicates something is going wrong with the index building (this makes sense for the provided chicago.tsv as we're building a clustered index where all the vectors are the same...)

I'll be putting up a patch shortly that catches and handles this; in the meantime, perhaps try building an exact index with --indexer-type exact in the index_dense_embeddings script (and then setting that parameter when in interactive as well).

edit: I investigated the indices returned from searching the index and they are all -1, which indicates that something is going wrong with the index building: https://github.com/facebookresearch/faiss/issues/244

klshuster commented 3 years ago

The fix for this has merged, feel free to reopen (or file a new issue and tag me) if you continue to run into problems