Adding dialogues as a context in the input text

geo47 commented 1 year ago

Hi,

Can we prepend a set of dialogue script between user and bot in the text as a seed conversation context. How much will it be effective to generate the relevant response.

Input pattern: input = "your persona:{persona}\n..\n{user_input} {bot_output}\n...\n{actual input text}"

Thanks!

klshuster commented 1 year ago

yes, you can do exactly as you've described. it remains to be seen how effective this would be, depending on the model used

geo47 commented 1 year ago

Thanks for the response..!

Which model would be suitable for this? I am currently using BB2_400M.

klshuster commented 1 year ago

any conversational model would be a reasonable choice, bb2_400m is good

geo47 commented 1 year ago

Thanks!

When I train the BB2 model on my custom dataset, it trains perfectly on custom dialogues. However, the training on custom dataset effects the BB2 model general responses.

For example: when you give any open dialogues that is not present in my dataset, the general BB2 model can generate a very generalized response and it also learns from the prepend persona information. However, the fine-tuned BB2 model on custom dataset does not generalize well and it also can not learn from the prepend personas information.

Here is an example of the dataset format I use for training the BB2 model:

{"dialog": [[{"id": "partner1", "text": "your persona: I am John\nyour persona: I live in Ohio.\ntell me a joke."}, {"id": "partner2", "text": "One time, I put strawberry jam on my burger. I thought it was ketchup!"}]]}

It seems like model is too much overfitting on the current task after fine-tuning and forget the BB2 default blended_skill_task features. Following is the command I use for fine-tuning the model. Do I need to update some parametes for proper training so that model can learn from the custom dataset while preserving the state of pre-trained BB2 model. (might be skipping generation or retrieval part --mutators skip_retrieval)?

parlai train_model -dp data \
--model projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent \
--task jsonfile --jsonfile-datapath dialogue_dialog_json_task.json \
--num_epochs 16 --batchsize 1 \
--memory-decoder-model-file "" --memory-key full_text \
--search-query-generator-model-file zoo:blenderbot2/query_generator/model --search-query-generator-beam-min-length 2 \
--save-every-n-secs 600 --validation_every_n_secs 600 --log_every_n_secs 60 \
--init-model zoo:blenderbot2/blenderbot2_400M/model --dict-file zoo:blenderbot2/blenderbot2_400M/model.dict \
--datatype train:stream  \
--embeddings-scale True --variant prelayernorm --split-lines True --learn-positional-embeddings True \
--n-layers 12 --embedding-size 1024 --ffn-size 4096 --n-heads 16 --n-decoder-layers 12 \
--dict-tokenizer gpt2 --generation-model bart \
--query-model bert_from_parlai_rag \
--rag-model-type token --rag-retriever-type dpr \
--dpr-model-file zoo:hallucination/bart_rag_token/model \
--gold-document-titles-key select-docs-titles --insert-gold-docs True \
--beam-min-length 5 --beam-context-block-ngram 3 --beam-block-ngram 3 --beam-block-full-context False --beam-size 3 \
--inference beam --optimizer mem_eff_adam --learningrate 1e-05 --lr-scheduler-patience 1 --model-parallel True \
--knowledge-access-method memory_only  \
--truncate 512 --text-truncate 512 --label-truncate 128 \
--dropout 0.0 --attention-dropout 0.0 \
--min-doc-token-length 64 --max-doc-token-length 256 \
--fp16 True --fp16-impl mem_efficient --force-fp16-tokens True \
--model-file model/odkg_model

Thanks!

klshuster commented 1 year ago

fine-tuning on new data while preserving knowledge of older capabilities is an open problem in language modeling. One approach would be to incorporate some original training data within the fine-tuned model to prevent this

geo47 commented 1 year ago

Would it be helpful if I add blended_skill_task task along with the new task --task jsonfile --jsonfile-datapath dialogue_dialog_json_task.json for training the model to preserve the knowledge?

klshuster commented 1 year ago

possibly!

geo47 commented 1 year ago

Hi @klshuster ,

Thanks for always helping with your feedbacks.

I have another question related to search_server: --search_server 0.0.0.0:8080. Is it possible to search from documents instead of a search engine ?

klshuster commented 1 year ago

Yes, we offer several ways of doing this, via the --rag-retriever-type, such as using neural retrieval over document embeddings, of TFIDF over document sets. Please see the relevant README for more details

geo47 commented 1 year ago

Hi @klshuster

Is there any length constraints for the input while adding personas and context given below..? input = "your persona:{persona}\n..\n{user_input} {bot_output}\n...\n{actual input text}"

klshuster commented 1 year ago

Only to the extent the bot can fit the tokenized input into it's context; that is determined by the --truncate / --text-truncate flags

if you're curious about bb2, that is 128 tokens (for the 3B model) and 1024 tokens (for the 400m model)

geo47 commented 1 year ago

How come 3B = 128 tokens while the 400m = 1024 tokens? I suppose the opposite makes sense.

klshuster commented 1 year ago

The real factor is the number of position embeddings the pre-trained models used; the 3B model just so happened to use 128 positions while the 400M used 1024 (3B is based off of BlenderBot 1, 400M based on BART)

geo47 commented 1 year ago

Hi @klshuster ,

I understand how the token length is defined for the mentioned models.

I am curious, few days back you told me that there are no length restrictions in terms of defining personas (mentioned here).

klshuster commented 1 year ago

Ahh, when I said there were no length restrictions I meant that ParlAI itself could handle any length; the models are bound by their truncation length

geo47 commented 1 year ago

Thank you so much for the clarification.

facebookresearch / ParlAI

Adding dialogues as a context in the input text #4870