Evlaution Script for BlenderBot 2.0

robinsongh381 commented 2 years ago

I was trying to reproduce ppl metric given here for blenderbot2.0 but could not find a relevant script.

I run the following command

parlai eval_model -mf zoo:blenderbot2/blenderbot2_400M/model -t msc -v

and face the following error

ValueError: Must provide a valid server for search

Presumably, the error arises because i did not specify a search server on which a query is searched.

I am not sure whether the reported ppl here is obtained with or without a specific search server.

Can you please share the exact evaluation script for which generates the reported ppl ?

Also, is there any way to evaluate without server server ?

Thank you

robinsongh381 commented 2 years ago

Trying the below command gives another error

parlai eval_model -mf zoo:blenderbot2/blenderbot2_400M/model -t msc -v --knowledge_access_method memory_only

Note that i added --knowledge_access_method memory_only argument which is a suggest solution for the above error as in #3972.

robinsongh381 commented 2 years ago

After some debugging, it turns out that modifying --rag_retriever_type argument disables search. I set it to dpr and run following command - which is working at the moment.

parlai eval_model -mf zoo:blenderbot2/blenderbot2_400M/model -t msc -v --knowledge-access-method memory_only --rag_retriever_type dpr

This is strange given that the --help shows that the default value for --rag_retriever_type argument is dpr but the default value seems to be search_engine - which causes the above AttributeError.

@klshuster

robinsongh381 commented 2 years ago

Having said that, I still want to know the exact eval script to reproduce the reported ppl here in the table Metrics Used and Evaluation Results for both Session 4 of Multi-Session Chat and Wizard of Internet.

Also, can you please explain what information did you use as context (eg. dialog history, predicted dialog summary, bing/google search, wikipedia dump, etc) during this evaluation ?

Thank you

klshuster commented 2 years ago

The default argument for --rag-retriever-type is indeed dpr, however when specifying a --model-file we load the option from that model opt, which is search_engine. In any case, your command should be enough to reproduce the PPL number reported in that table. Note that you're using the 400m model, whereas that table reports the 2.7B model.

robinsongh381 commented 2 years ago

@klshuster Thank you for reply.

Im evaluating the 2.7B model with and without --include_last_session True argument, and the results are different in terms of validation ppl.

Could you pleas example the role of the include_last_session argument and which value (True or False) shold I use in order to reproduce the result ?

Thank you

robinsongh381 commented 2 years ago

@klshuster

This is the result of the following comand

parlai eval_model -mf zoo:blenderbot2/blenderbot2_3B/model -t msc -v --knowledge-access-method none --rag_retriever_type dpr  --log_every_n_secs 60 --batchsize 32

You can see the ppl of Session 4 is 9.976, which is slightly higher than reported value of 9.84635.

What do you think the reason for the gap ?

Is this because of --knowledge-access-method none ?
Do I need to use different value ? such as memory_only?
Or do I need to use full_text for --memory-key argument instead of personas - the default value ?

klshuster commented 2 years ago

i would try --knowledge-access-method memory_only

--include-last-session is a parameter that controls whether to include the final session of Multi-Session Chat (MSC) in the model evaluation; this should be True

robinsongh381 commented 2 years ago

@klshuster Thank you for kind reply.

After evaluating with --knowledge-access-method memory_only, I obtained following results - still showed some gap with respect to the reported value of 9.8463.

msc_dialogue_4/exs:5904
msc_dialogue_4/accuracy:0
msc_dialogue_4/f1:0.18795490970546966
msc_dialogue_4/bleu-4:0.0068064881917066005
msc_dialogue_4/clen:1214.0984078590786
msc_dialogue_4/ctrunc:1
msc_dialogue_4/ctrunclen:1086.0984078590786
msc_dialogue_4/llen:32.38245257452574
msc_dialogue_4/ltrunc:0.0005081300813008131
msc_dialogue_4/ltrunclen:0.0016937669376693768
msc_dialogue_4/loss:2.295536981786989
msc_dialogue_4/ppl:9.929766684779013
msc_dialogue_4/token_acc:0.46994392601581786
msc_dialogue_4/token_em:0
msc_dialogue_4/gen_n_toks:26.604336043360433

This is my full opt. Can you suggest another advice in order to reproduce the reported value ?

init_opt:null
allow_missing_init_opts:false
task:"msc"
download_path:null
loglevel:"info"
datatype:"valid"
image_mode:"raw"
hide_labels:false
0:1
batchsize:16
dynamic_batching:null
verbose:true
is_debug:false
datapath:"/data/private/language_model/git/ParlAI/data"
model:null
model_file:"/data/private/language_model/git/ParlAI/data/models/blenderbot2/blenderbot2_3B/model"
init_model:null
dict_class:"parlai.core.dict:DictionaryAgent"
report_filename:"/data/private/language_model/git/large-scale-ai/blenderbot-demo/eval_result/blenderbot2_3B_knowledge_memory_only.json"
world_logs:""
save_format:"conversations"
area_under_curve_digits:-1
area_under_curve_class:null
num_examples:-1
display_examples:false
log_every_n_secs:60
metrics:"default"
aggregate_micro:false
log_keep_fields:"all"
tensorboard_log:false
tensorboard_logdir:null
image_size:256
image_cropsize:224
include_session1:true
include_last_session:false
session_id:2
previous_persona_type:"raw_history"
session_openning:false
label_speaker_id:"both"
include_time_gap:false
history_time_gaps_token:null
history_person_tokens:null
previous_session_delimiter:null
mutators:null
your_persona_first:true
max_num_turns:-1
is_convai2_session_level:false
candidates:"inline"
eval_candidates:"inline"
interactive_candidates:"fixed"
repeat_blocking_heuristic:true
fixed_candidates_path:null
fixed_candidate_vecs:"reuse"
encode_candidate_vecs:true
encode_candidate_vecs_batchsize:256
train_predict:false
cap_num_predictions:100
ignore_bad_candidates:false
rank_top_k:-1
return_cand_scores:false
use_memories:false
wrap_memory_encoder:false
memory_attention:"sqrt"
normalize_sent_emb:false
share_encoders:true
learn_embeddings:true
data_parallel:false
reduction_type:"mean"
polyencoder_type:"codes"
poly_n_codes:64
poly_attention_type:"basic"
poly_attention_num_heads:4
codes_attention_type:"basic"
codes_attention_num_heads:4
generation_model:"bart"
query_model:"bert"
rag_model_type:"token"
thorough:false
n_extra_positions:0
gold_knowledge_passage_key:"checked_sentence"
gold_knowledge_title_key:"title"
rag_retriever_query:"full_history"
rag_retriever_type:"dpr"
retriever_debug_index:null
n_docs:5
min_doc_token_length:64
max_doc_token_length:256
rag_query_truncate:512
print_docs:false
path_to_index:"zoo:hallucination/wiki_index_compressed/compressed_pq"
path_to_dense_embeddings:null
dpr_model_file:"zoo:hallucination/multiset_dpr/hf_bert_base.cp"
path_to_dpr_passages:"zoo:hallucination/wiki_passages/psgs_w100.tsv"
retriever_embedding_size:768
tfidf_max_doc_paragraphs:-1
tfidf_model_path:"zoo:wikipedia_full/tfidf_retriever/model"
dpr_num_docs:25
poly_score_initial_lambda:0.5
polyencoder_init_model:"wikito"
poly_faiss_model_file:null
regret:false
regret_intermediate_maxlen:32
regret_model_file:null
regret_dict_file:null
regret_override_index:false
indexer_type:"compressed"
indexer_buffer_size:65536
compressed_indexer_factory:"IVF4096_HNSW128,PQ128"
compressed_indexer_gpu_train:false
compressed_indexer_nprobe:64
hnsw_indexer_store_n:128
hnsw_ef_search:128
hnsw_ef_construction:200
rag_turn_n_turns:2
rag_turn_marginalize:"doc_then_turn"
rag_turn_discount_factor:1
embedding_size:1024
n_layers:2
ffn_size:4096
dropout:0.1
attention_dropout:0.1
relu_dropout:0
n_heads:16
learn_positional_embeddings:true
embeddings_scale:false
n_positions:1024
n_segments:0
variant:"bart"
activation:"gelu"
output_scaling:1
share_word_embeddings:true
n_encoder_layers:12
n_decoder_layers:12
model_parallel:false
checkpoint_activations:false
init_fairseq_model:null
output_conversion_path:null
beam_size:1
beam_min_length:1
beam_context_block_ngram:-1
beam_block_ngram:-1
beam_block_full_context:true
beam_length_penalty:0.65
skip_generation:false
inference:"greedy"
topk:10
topp:0.9
beam_delay:30
beam_block_list_filename:null
temperature:1
compute_tokenized_bleu:false
gpu_beam_blocking:false
interactive_mode:false
embedding_type:"random"
embedding_projection:"random"
fp16:false
fp16_impl:"safe"
force_fp16_tokens:false
optimizer:"adamax"
learningrate:0.0001
gradient_clip:0.1
adam_eps:1e-8
0:1e-30
1:0.001
momentum:0
nesterov:true
0:0.7
0:0.9
1:0.999
weight_decay:null
rank_candidates:false
truncate:1024
text_truncate:null
label_truncate:null
history_reversed:false
history_size:-1
person_tokens:false
split_lines:false
use_reply:"label"
add_p1_after_newln:false
delimiter:" "
history_add_global_end_token:null
special_tok_lst:null
gpu:-1
no_cuda:false
dict_file:null
dict_initpath:null
dict_language:"english"
dict_max_ngram_size:-1
dict_minfreq:0
dict_maxtokens:-1
dict_nulltoken:"__null__"
dict_starttoken:"__start__"
dict_endtoken:"__end__"
dict_unktoken:"__unk__"
dict_tokenizer:"gpt2"
dict_lower:false
bpe_debug:false
dict_textfields:"text,labels"
bpe_vocab:null
bpe_merge:null
bpe_add_prefix_space:null
bpe_dropout:null
lr_scheduler:"reduceonplateau"
lr_scheduler_patience:3
lr_scheduler_decay:0.5
invsqrt_lr_decay_gamma:-1
warmup_updates:-1
warmup_rate:0.0001
update_freq:1
t5_model_arch:"t5-base"
t5_model_parallel:false
t5_dropout:0
t5_generation_config:null
search_query_generator_model_file:null
search_query_generator_inference:"greedy"
search_query_generator_beam_min_length:1
search_query_generator_beam_size:1
search_query_generator_text_truncate:512
splitted_chunk_length:256
doc_chunk_split_mode:"word"
n_ranked_doc_chunks:1
doc_chunks_ranker:"head"
woi_doc_chunk_size:500
search_server:null
knowledge_access_method:"memory_only"
memory_key:"full_text"
query_generator_key:"full_text"
gold_document_key:"__selected-docs__"
gold_sentence_key:"__selected-sentences__"
gold_document_titles_key:"__select-docs-titles__"
skip_search_key:"skip_search"
insert_gold_docs:false
memory_extractor_phrase:"persona:"
retriever_ignore_phrase:"persona:"
query_generator_ignore_phrase:"persona:"
query_generator_model_file:"zoo:blenderbot2/query_generator/model"
query_generator_delimiter:" "
query_generator_inference:"beam"
query_generator_beam_size:1
query_generator_beam_min_length:2
query_generator_truncate:-1
memory_retriever_truncate:-1
retriever_delimiter:" "
share_search_and_memory_query_encoder:false
memory_reader_model:null
memory_doc_title_delimiter:" / "
memory_writer_model:"bert"
memory_writer_model_file:"zoo:hallucination/multiset_dpr/hf_bert_base.cp"
add_cleaned_reply_to_history:false
memory_decoder_key:"full_text"
memory_decoder_ignore_phrase:"persona:"
memory_decoder_model_file:""
memory_decoder_delimiter:" "
memory_decoder_beam_size:3
memory_decoder_beam_min_length:10
memory_decoder_truncate:-1
memory_decoder_one_line_memories:false
parlai_home:"/data/private/language_model/git/ParlAI"

klshuster commented 2 years ago

Ahh yes, two changes:

--previous-persona-type none --memory-key personas

That should bring this closer to what the model was evaluated on

github-actions[bot] commented 1 year ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

facebookresearch / ParlAI

Evlaution Script for BlenderBot 2.0 #4681