How to insert gold docs for BlenderBot2

rguan1 commented 2 years ago

Hi,

I've been looking at the documentation here on how to incorporate my own documents for BlenderBot2 to use. However, it is a little unclear to me how to implement gold docs based on the documentation. So, I would like to add some clarifying questions.

So, in the instructions for "Insert Gold Docs", it seems to me that I have to create a dataset such that it has the fields "gold-document-key", "gold-sentence-key", and "gold-document-titles-key" along with the typical fields "text" and "labels" described in the ParlAI docs. This means that the documents that are in a message are specified per text and label pair, right? Also, once I've set those fields, then I simply have to add the flag --insert-gold-docs True, and it will just work?

klshuster commented 2 years ago

Yes, that's almost it, except you'll want to actually set the keys to whatever is specified by the --gold-document-key, --gold-sentence-key, and --gold-document-titles-key args.

Another way that we have provided for doing this is by specifying --model projects.blenderbot2.agents.blenderbot2:BlenderBot2WizIntGoldDocRetrieverFiDAgent. This agent expects the following keys in an observation:

'__retrieved-docs-titles__': list of document titles
'__retrieved-docs-urls__': list of doc urls (can be empty strings)
'__retrieved-docs__': list of retrieved documents
'__selected-sentences__': list of selected sentences from the document (can be empty)

rguan1 commented 2 years ago

"Yes, that's almost it, except you'll want to actually set the keys to whatever is specified by the --gold-document-key, --gold-sentence-key, and --gold-document-titles-key args."

Could you clarify this? What do you mean by settings keys to what is specified by the field args?

Also, is there a similar function for inserting gold docs into the search results for interactive runs of Blenderbot2 or for self chat? Or is forcing certain documents to show up in search results something that must be overriden in the search server code?

Thank you!

klshuster commented 2 years ago

You don't want to put gold_document_key in the message, you want to put the value of the passed parameter

Currently we do not support inserting gold docs into search results during interactive or self chat. That will need to be overridden in the search server

Edit: you could override the LocalHumanAgent instantiated in Interactive to provide those keys as specified when sending a message, and then specify --insert-gold-docs True (modify the reply returned in act)

rguan1 commented 2 years ago

I see. Thank you!

Regarding overriding the search server to serve gold docs, what purpose does the url parameter of the search server response serve? Based on the docs, it told me to look here to understand how the url is used. However, I don't see the url parameter being used in the retrieve_api.py file, and I'm not sure how the url is used especially since we are already given the title and content parameters along with the url.

klshuster commented 2 years ago

The URLs are not used functionally in the code, they are there in case one would like to use them

rguan1 commented 2 years ago

Ok, I think all my questions are answered. Thank you again! I'll close the issue.

rguan1 commented 2 years ago

Hi,

I just tried to use the projects.blenderbot2.agents.blenderbot2:BlenderBot2WizIntGoldDocRetrieverFiDAgent model when incorporating gold labels. I just ran into a large (long msg) error when I tried running the command below.

python3 $HOME/ParlAI/parlai/scripts/train_model.py --task ec_gold --num_epochs 1 --model-file $HOME/thesis/chatbot_models/BB2/ec_gold_test/model  \
--model projects.blenderbot2.agents.blenderbot2:BlenderBot2WizIntGoldDocRetrieverFiDAgent \
--dict-file zoo:blenderbot2/blenderbot2_3B/model.dict --init-model zoo:blenderbot2/blenderbot2_3B/model \
--skip-generation True --knowledge-access-method all --memory-key full_text --search-server 0.0.0.0:8080 \
--metrics ppl --validation-metric ppl --model-parallel True --memory-decoder-model-file '' --datatype train:stream \
--generation-model transformer/generator --query-model bert_from_parlai_rag --rag-model-type token \
--rag-retriever-type dpr --max-doc-token-length 64 --dpr-model-file zoo:hallucination/bart_rag_token/model \
--beam-size 10 --beam-min-length 20 --beam-context-block-ngram 3 --beam-block-ngram 3 \
--beam-block-full-context False --inference nucleus --fp16-impl mem_efficient --optimizer adam --truncate 128 \
--text-truncate 128 --label-truncate 128 --history-add-global-end-token end --warmup-updates 100 \
--insert-gold-docs True --activation gelu --attention-dropout 0.0 --dict-tokenizer bytelevelbpe --dropout 0.1 \
--embedding-size 2560 --ffn-size 10240 --force-fp16-tokens true --fp16 true --n-decoder-layers 24 --n-encoder-layers 2 \
--n-heads 32 --n-positions 128 --variant prelayernorm --delimiter ' '

The start of the error looks like this and then it continues listing a bunch of layers

RuntimeError: Error(s) in loading state_dict for BlenderBot2FidModel:
    Unexpected key(s) in state_dict: "retriever.query_encoder.embeddings.weight", 
"retriever.query_encoder.position_embeddings.weight", "retriever.query_encoder.norm_embeddings.weight",
 "retriever.query_encoder.norm_embeddings.bias", "retriever.query_encoder.segment_embeddings.weight"
...

If I substitute the --model parameter with projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent, the command above works. I suspect one of the arguments are off again, but not sure how to proceed. Any help would be appreciated!

klshuster commented 2 years ago

are you on the latest version of ParlAI? I ran your command locally and did not run into any issues

rguan1 commented 2 years ago

Got it. I'll try that. Thanks.

github-actions[bot] commented 2 years ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

facebookresearch / ParlAI

How to insert gold docs for BlenderBot2 #4467