facebookresearch / ParlAI

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
https://parl.ai
MIT License
10.49k stars 2.1k forks source link

BB3x Multi-turn Conversation #5063

Closed conceptofmind closed 1 year ago

conceptofmind commented 1 year ago

Hi all,

What is the recommended way to export the data from BB3x in a format where the multi-turn conversation is combined into a singular fluid example:

Are there any new games you've been playing on your phone? I play among us on my phone I can not believe it is almost thanksgiving. I am ready to eat! Are you? What are your favorite foods? Thanksgiving isn’t really that soon What kind of food do you like? I know a lot of people don't like turkey because it is dry.

I need to export the data containing the ~261k conversations as json files.

Thank you,

Enrico

mojtaba-komeili commented 1 year ago

I think you can find what you are looking for in our flattening mutator.

conceptofmind commented 1 year ago

Hi @mojtaba-komeili,

When I run parlai convert_to_json --task projects.bb3x.tasks.agents:BB3DataBotTeacher --datatype train:evalmode --mutators flatten

I get the output:

{"dialog": [[{"user_pseudo_id": "69e0129e9e584e22e191750b250d4c94", "convo_id": "f46f92643b8944ce89b0a6316de2156c", "label_info": {"text": "I can not believe it is almost thanksgiving. I am ready to eat! Are you? What are your favorite foods?", "sender": "bot", "memories": ["Person 2's Persona: I like Ready Player One.", "Person 2's Persona: I have mixed feelings about Facebook.", "Person 1's Persona: I know about the company Meta.", "Person 2's Persona: I am concerned about privacy.", "Person 1's Persona: I have mixed feelings.", "Person 2's Persona: I am concerned about the environment.", "Person 1's Persona: I am aware of the privacy issues with meta.", "Person 2's Persona: I am concerned about Google Drive.", "Person 1's Persona: I am concerned about Google's privacy policies.", "Person 2's Persona: I use Facebook.", "Person 1's Persona: I avoid Google products and services.", "Person 2's Persona: I know what Reddit is.", "Person 1's Persona: I don't use social media. I read Reddit.", "Person 2's Persona: I use Reddit.", "Person 1's Persona: I like the movie.", "Person 1's Persona: I think privacy is a luxury.", "Person 2's Persona: I know about Google.", "Person 1's Persona: I don't know if Google owns Reddit.", "Person 1's Persona: I like video games and reading.", "Person 2's Persona: I worry about my privacy online.", "Person 1's Persona: I don't like to be hostile.", "Person 2's Persona: I like to argue and fight online.", "Person 1's Persona: I don't use social media. I value privacy.", "Person 2's Persona: I have kids. I am unemployed.", "Person 1's Persona: I am in school. I want to work for NASA.", "Person 2's Persona: I have kids.", "Person 2's Persona: I play video games.", "Person 2's Persona: I have school work.", "Person 1's Persona: I am busy.", "Person 2's Persona: I have friends and family.", "Person 1's Persona: I am concerned with privacy.", "Person 2's Persona: I like board games.", "Person 1's Persona: I play games on my phone.", "Person 2's Persona: I like to eat.", "Person 1's Persona: I play Among Us on my phone."], "search_decision": "do not search", "memory_decision": "access memory", "memory_knowledge": "Person 2's Persona: I have mixed feelings about Facebook.", "search_query": "", "search_knowledge": "", "search_knowledge_doc_titles": [""], "search_knowledge_doc_content": [""], "search_knowledge_doc_urls": [""], "is_liked": false, "is_safety_controlled_response": false, "safety": {"duo_safety": "__ok__", "string_matcher": "__ok__"}, "reward": "Predicted class: __notok__\nwith probability: 0.9384"}, "adversarial_human": true, "text": "Are there any new games you've been playing on your phone?\nI play among us on my phone", "episode_done": true, "id": "BaseBB3DataTeacher", "eval_labels": ["I can not believe it is almost thanksgiving. I am ready to eat! Are you? What are your favorite foods?"]}, {"id": "RepeatLabelAgent", "text": "I can not believe it is almost thanksgiving. I am ready to eat! Are you? What are your favorite foods?", "episode_done": false, "metrics": {"exs": 1, "accuracy": 1.0, "precision": 1.0, "recall": 1.0, "f1": 1.0, "bleu-4": 1.0}}]], "context": [], "metadata_path": "tmp.metadata"}

Should I be extracting the strings from memories? It is not particularly clear to me what fields should be used. There are also 2.6 million examples in this set. How should one get the exact 260k conversations?

The particular example above is not very coherent.

Thank you,

Enrico

mojtaba-komeili commented 1 year ago

The field that you are looking for is text. But I am not sure why you are trying to convert the dataset into json. If you are within the ParlAI platform you must be able to directly use the teacher. For example, parlai display_data --task projects.bb3x.tasks.agents:BB3DataBotTeacher --datatype train:evalmode is the command similar to what you used here for displaying the flattened conversations. You may use similar arguments for training or evaluating a model with the flattened conversations. I hope this helped, but do not hesitate to ask if extra clarification is needed.

github-actions[bot] commented 1 year ago

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.