facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.49k stars 2.1k forks source link

Training BB2 on custom dataset #4739

Closed geo47 closed 2 years ago

geo47 commented 2 years ago

I want to train the BB2 model on custom dataset with bot persona. I have the dataset in question-answer dialog format.

I prepared ParlAI format dataset using the guidance given here.

I am able to generate the task from the JSON file by following the instruction. However, I am curious, how can I add persona context as the format described in the guide only contains input-response pair.
For example: {"id": "partner1", "text": "hello how are you today?"}, {"id": "partner2", "text": "i'm great thanks! what are you doing?"}

PS. I want to add persona information because each bot suppose to generate different responses for the same question based on their persona context.

Thanks.

klshuster commented 2 years ago

you can add the personas directly to the message, e.g. it would look like the following:

{"id": "partner1", "text": "hello how are you today?", "personas": ['your persona: bot_persona1', 'your persona: bot_persona2']}
geo47 commented 2 years ago

Hi,

Thanks for you response.

I have a few more queries....

The problem looks like this:

Creating 2 bots each having a different personas.

Bot1: {"id": "partner1", "text": "tell me a joke.", "label": One time, I put strawberry jam on my burger. I thought it was ketchup!",  "personas": ['your persona:  I am John', 'your persona: I live in Ohio.']}
Bot2: {"id": "partner1", "text": "tell me a joke.", "label": "Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine!", "personas": ['your persona:  I am Ellie', 'your persona: I live in New York.']}

Given user input with Bot persona: 'your persona: Bot 1 persona\n user_input', the model should generate the response according to the given Bot persona.

input: "your persona: I am Ellie\n I live in New York.\n tell me a joke."
output: "Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine!"
  1. Is there any length restrictions for defining the personas in the message you mentioned above?

  2. For every input-output samples, do we have to add the same personas?

    {"id": "partner1", "text": "hello how are you today?", "personas": ['your persona: bot_persona1', 'your persona: bot_persona2']}
  3. Does the following JSON format is correct for making the training dataset?

{
    "dialog": [
        {
            "id": "partner1",
            "text": "tell me a joke.",
            "label": "One time, I put strawberry jam on my burger. I thought it was ketchup!",
            "personas": [
                "your persona:  I am John",
                "your persona: I live in Ohio."
            ],
            "label_candidates": [
                "One time, I put strawberry jam on my burger. I thought it was ketchup!",
                "Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine!"
            ],
            "episode_done": true
        },
        {
            "id": "partner1",
            "text": "tell me a joke.",
            "label": "Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine!",
            "personas": [
                "your persona:  I am Ellie",
                "your persona: I live in New York."
            ],
            "label_candidates": [
                "One time, I put strawberry jam on my burger. I thought it was ketchup!",
                "Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine!"
            ],
            "episode_done": true
        }
    ]
}

Also, currently I have successfully trained the BB2 model on custom dataset along with blended_skill_talk, convai2, and custom_dataset.txt as follows:

parlai train_model --model transformer/generator \
--task blended_skill_talk,convai2,fromfile:parlaiformat --fromfile_datapath parlai_dataset/archi_test.txt \
--multitask-weights 1,3,3,3 --num-epochs 5 \
--init-model zoo:blender/blender_90M/model \
--dict-file zoo:blender/blender_90M/model.dict \
--embedding-size 512 --n-layers 8 --ffn-size 2048 --dropout 0.1 --n-heads 16 \
--learn-positional-embeddings True --n-positions 512 --variant xlm --activation gelu --fp16 True \
--text-truncate 512 --label-truncate 128 --dict-tokenizer bpe --dict-lower True -lr 1e-06 \
--optimizer adamax --lr-scheduler reduceonplateau --gradient-clip 0.1 -veps 0.25 --betas 0.9,0.999 \
--update-freq 1 --attention-dropout 0.0 --relu-dropout 0.0 --skip-generation True -vp 15 -stim 60 \
--vme 20000 -bs 16 -vmt ppl -vmm min --save-after-valid True --model-file /projects/ParlAI/models/odkg_model
  1. Could you please verify the parameters missing here or suggest any to boost the learning and training process?
    • The current trained model is able to learn and generate the custom responses , but since the current dataset does not contain the personas of bot, it could not generate the appropriate bot responses. Hope to see it working after adding the personas information.

Thanks...

klshuster commented 2 years ago
  1. Length restrictions: No, I don't believe there are any length restrictions
  2. yes, the personas should stay consistent across 1 episode of dialogue. but for multiple episodes in a dataset they can change
  3. yes, that looks good for the dataset
  4. your parameters look good for training a BB1 model, however you'll want to specify -m --model projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent; see e.g. a sample command in #4347 for more parameter details (however, you'll want to keep the architectural specifics you already have here if using the 90m model)
geo47 commented 2 years ago

Hi @klshuster , Thanks for your response.

After adding personas to my dataset when I train the model using the command I run before, it seems not learning the personas context.

Then, I replaced the model from transformer/generator to projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent and added --memory-key full_text key with the the same setting as previous:

parlai train_model --model projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent \
--task blended_skill_talk,convai2,fromfile:parlaiformat --fromfile_datapath parlai_dataset/archi_test.txt \
--multitask-weights 1,3,3,3 --num-epochs 5 \
--init-model zoo:blender/blender_90M/model \
--dict-file zoo:blender/blender_90M/model.dict \
--memory-key full_text
--embedding-size 512 --n-layers 8 --ffn-size 2048 --dropout 0.1 --n-heads 16 \
--learn-positional-embeddings True --n-positions 512 --variant xlm --activation gelu --fp16 True \
--text-truncate 512 --label-truncate 128 --dict-tokenizer bpe --dict-lower True -lr 1e-06 \
--optimizer adamax --lr-scheduler reduceonplateau --gradient-clip 0.1 -veps 0.25 --betas 0.9,0.999 \
--update-freq 1 --attention-dropout 0.0 --relu-dropout 0.0 --skip-generation True -vp 15 -stim 60 \
--vme 20000 -bs 16 -vmt ppl -vmm min --save-after-valid True --model-file /projects/ParlAI/models/odkg_model

It gives the following error:

 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for BlenderBot2FidModel:
        Missing key(s) in state_dict: "seq2seq_encoder.embeddings.weight"

When using the command from above issue with --init-model zoo:blenderbot2/blenderbot2_90M/model --dict-file zoo:blenderbot2/blenderbot2_90M/model.dict:

parlai train_model -dp data \
--model projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent \
--task fromfile:parlaiformat --fromfile_datapath dataset/archi_parlai.txt \
--num_epochs 20 \
--memory-decoder-model-file "" --memory-key full_text \
--search-query-generator-model-file zoo:blenderbot2/query_generator/model --search-query-generator-beam-min-length 2 \
--save-every-n-secs 600 --validation_every_n_secs 600 --log_every_n_secs 60 \
--init-model zoo:blenderbot2/blenderbot2_90M/model --dict-file zoo:blenderbot2/blenderbot2_90M/model.dict \
--datatype train:stream  \
--embeddings-scale True --variant prelayernorm --split-lines True --learn-positional-embeddings True \
--n-layers 12 --embedding-size 1024 --ffn-size 4096 --n-heads 16 --n-decoder-layers 12 \
--dict-tokenizer gpt2 --generation-model bart \
--query-model bert_from_parlai_rag \
--rag-model-type token --rag-retriever-type search_engine --search_server None \
--dpr-model-file zoo:hallucination/bart_rag_token/model \
--gold-document-titles-key select-docs-titles --insert-gold-docs True \
--beam-min-length 5 --beam-context-block-ngram 3 --beam-block-ngram 3 --beam-block-full-context False --beam-size 3 \
--inference beam --optimizer mem_eff_adam --learningrate 1e-05 --lr-scheduler-patience 1 --model-parallel True \
--knowledge-access-method memory_only --batchsize 16 \
--truncate 512 --text-truncate 512 --label-truncate 128 \
--dropout 0.0 --attention-dropout 0.0 \
--min-doc-token-length 64 --max-doc-token-length 256 \
--fp16 True --fp16-impl mem_efficient --force-fp16-tokens True \
--model-file /projects/ParlAI/models/odkg_model

It gives the following error:

ImportError: Could not find pretrained model in parlai.zoo.blenderbot2.blenderbot2_90M or parlai.zoo.blenderbot2.build. Please check your spelling and make sure you've pulled from master.

When I changed the --init-model zoo:blender/blender_90M/model --dict-file zoo:blender/blender_90M/model.dict \, then it gave me the following error.

size mismatch for embeddings.weight: copying a param with shape torch.Size([54944, 512]) from checkpoint, the shape in current model is torch.Size([99240, 1024]).
-----------------
Could not load the model due to a size mismatch in the embeddings. A common reason for this is trying to load a model trained with fp16 but loaded without fp16. Try adding --fp16 true or --force-fp16-tokens true.

The only way I was able to run the model is using the exact same script with zoo:blenderbot2/blenderbot2_400M/model. Following is the full script I run to train the model.

parlai train_model -dp data \
--model projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent \
--task fromfile:parlaiformat --fromfile_datapath dataset/archi_parlai.txt \
--num_epochs 20 \
--memory-decoder-model-file "" --memory-key full_text \
--search-query-generator-model-file zoo:blenderbot2/query_generator/model --search-query-generator-beam-min-length 2 \
--save-every-n-secs 600 --validation_every_n_secs 600 --log_every_n_secs 60 \
--init-model zoo:blenderbot2/blenderbot2_400M/model --dict-file zoo:blenderbot2/blenderbot2_400M/model.dict \
--datatype train:stream  \
--embeddings-scale True --variant prelayernorm --split-lines True --learn-positional-embeddings True \
--n-layers 12 --embedding-size 1024 --ffn-size 4096 --n-heads 16 --n-decoder-layers 12 \
--dict-tokenizer gpt2 --generation-model bart \
--query-model bert_from_parlai_rag \
--rag-model-type token --rag-retriever-type search_engine --search_server None \
--dpr-model-file zoo:hallucination/bart_rag_token/model \
--gold-document-titles-key select-docs-titles --insert-gold-docs True \
--beam-min-length 5 --beam-context-block-ngram 3 --beam-block-ngram 3 --beam-block-full-context False --beam-size 3 \
--inference beam --optimizer mem_eff_adam --learningrate 1e-05 --lr-scheduler-patience 1 --model-parallel True \
--knowledge-access-method memory_only --batchsize 1 \
--truncate 512 --text-truncate 512 --label-truncate 128 \
--dropout 0.0 --attention-dropout 0.0 \
--min-doc-token-length 64 --max-doc-token-length 256 \
--fp16 True --fp16-impl mem_efficient --force-fp16-tokens True \
--model-file projects/ParlAI/odkg_model

and the data format was:

text: Tell me a joke.[TAB]labels: Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine![TAB]episode_done: True[TAB]personas: your persona: I am John.|your persona: I live in Ohio.[NEW_LINE]

The model trained successfully. However, the model was not able to respond based on bot personas. In my case (single-turn conversation), for the same question, there are different responses for different bots based on persona. The confusion I have here is that,

response: Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine!



So here are my few more questions.
1. Please check the above command and let me know what is the best way to train the model with appropriate parameters.
2. Do you think the model is able to train and learn to response differently for same question based on personas. If so, could you please guide me how to achieve this?

Thanks...
klshuster commented 2 years ago
  1. If you set --memory-key full_text, then you'll want to prepend the persona lines to the text in your text field; if you're multitasking with BST and convai2, I would recommend doing that so that it's consistent across datasets
  2. To get your first command to work (with the 90m model), you'll need to add the following options:
--memory-decoder-model-file '' --query-generator-model-file '' --generation-model transformer/generator --knowledge-access-method memory_only 
geo47 commented 2 years ago

Hi,

Thanks for your responses and guidance. I am able to train the BB2 model on custom dataset with personas context by adding --memory-key full_text. However, In my case the model with 90m doesn't work (may be some parameters issues.)

Thanks!

geo47 commented 2 years ago

Hi,

I have another question regarding the dataset. Previously, I added personas for response generation. Apart from persona, how can we add the dialog history in the dataset..?

For example in my current dataset given below,

  1. Can we add the history key with few previous dialogs in the conversation sequence? and
  2. Whether the projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent model will also learn the history information?
    {"dialog": [[{"id": "partner1", "text": "your persona:  I am John.\nyour persona: I live in Ohio.\ntell me a joke."}, {"id": "partner2", "text": "One time, I put strawberry jam on my burger. I thought it was ketchup!"}]]}
    {"dialog": [[{"id": "partner1", "text": "your persona:  I am Ellie.\nyour persona: I live in New York.\ntell me a joke."}, {"id": "partner2", "text": "Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine!"}]]}

Thanks

klshuster commented 2 years ago

so the dialog key is a list of messages within a conversational "episode", as we call it in ParlAI. The agent will remember automatically all previous conversation within a given episode, and will reset that history after the final example within the episode is shown.

Alternatively, you could also have each entry in your dataset be a "flattened" version of the episode, where you include all the previous utterances as delimited messages in the "text" field.

Adding a history key will not do anything at the moment.

github-actions[bot] commented 2 years ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

geo47 commented 2 years ago

Hi,

I have another question regarding the dataset. Previously, I added personas for response generation. Apart from persona, how can we add the dialog history in the dataset..?

For example in my current dataset given below,

  1. Can we add the history key with few previous dialogs in the conversation sequence? and
  2. Whether the projects.blenderbot2.agents.blenderbot2:BlenderBot2FidAgent model will also learn the history information?
{"dialog": [[{"id": "partner1", "text": "your persona:  I am John.\nyour persona: I live in Ohio.\ntell me a joke."}, {"id": "partner2", "text": "One time, I put strawberry jam on my burger. I thought it was ketchup!"}]]}
{"dialog": [[{"id": "partner1", "text": "your persona:  I am Ellie.\nyour persona: I live in New York.\ntell me a joke."}, {"id": "partner2", "text": "Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine!"}]]}

Thanks

Hi @klshuster ,

I have another query regarding the dataset. With the above dataset, how can we add the context of the conversation or episode.

For example: The context could be conversation topic: talk about movies, talk about work, talk about travel etc. So that the conversation would strictly follow the topic and generate topic related dialog responses.

Here is an example work: personaGPT

One way I found is making the dataset similar to the blended_skill_tak, but with this settings I have a few queries. 1- Can I add the topic in Wizard of Wikipedia topic: key? 2- If so, then how can we query the model in inferencing time, blender_agent.observe({'text': bot_input, "personas": npc_agent_list[npc], 'episode_done': True}) *_persona prepending in the botinput blender_agent.observe({'text': bot_input, 'episode_done': True})

Should we use the same Wizard of Wikipedia topic: key to add the topic in query during inferencing?

Thanks!

klshuster commented 2 years ago

If you add specialized keys beyond text, or the ones that BB2 normally uses, they will be ignored by the model. Your best bet is to either prepend the topic to the given text, or override the agent's observe function to specially process any keys you add

geo47 commented 2 years ago

Hi @klshuster

Thank you so much for the response. I have prepend the topic tag in the text similar to your persona tag just before the input query.

Do you think the model will be able to learn in this way...?

{"dialog": [[{"id": "partner1", "text": "your persona:  I am John.\nyour persona: I live in Ohio.\ntopic: talk about movies\ntell me a joke."}, {"id": "partner2", "text": "One time, I put strawberry jam on my burger. I thought it was ketchup!"}]]}
{"dialog": [[{"id": "partner1", "text": "your persona:  I am Ellie.\nyour persona: I live in New York.\ntopic: talk about movies\ntell me a joke."}, {"id": "partner2", "text": "Let me tell you my favorite joke. Why was six afraid of seven? Because seven ate nine!"}]]}

Thanks!

klshuster commented 2 years ago

Do you think the model will be able to learn in this way...?

if you have suitable training data, I would guess it'll learn something. Only one way to find out!

geo47 commented 2 years ago

if you have suitable training data, I would guess it'll learn something. Only one way to find out!

I tried this approach. The model seems learning from the context and topics. However, it fails to handle response for any random negative input query.

For example,

The model seems learning and recognizing the topic contexts. However, for a given input, whatever the PC input is (positive or negative - random), it always generates the output response based on the script.

Since the data does not have negative samples, the model seems learning only positive responses.