about tulu v2 conversion

nuoma commented 7 months ago

Hi, I have a two questions regarding the tulu v2 dataset.

Q1. If I'm understanding correctly, the reformat_datasets.py is used to reproduce the tulu dataset. It converts a bunch of different datasets and concat into tulu v2. I can see random subsamplings, have you considered a more curated approach of subsampling from these dataset? Will it help further increase performance? (or perhaps the granularity of this research question belongs to a different study)

Q2. I am trying to convert the tulu v2 dataset into sharegpt multiturn format to adapt to my existing code. I first download the 3 parquet files, turn them into jsonl, and turn the format into sharegpt format. However, I find it extremely hard to perform the conversion, either by regex or even manually. Somehow there are always some JSON formatting issues that hf.dataset fails to deal with. Here is some code snippet that i tried to use:

def process_jsonl_file(input_file_path, output_file_path):
    # Define the regex patterns and replacements
    replacements = [
        (r'^"', ''),  # Remove leading quotation mark at the start of each line
        (r'\}\]"', '"}]}'),  # Replace '}]" with "}]},
        (r"\[{'role': 'user', 'content': \\", '{"conversations": [{"from": "human", "value": "'),  # Handle escape character
        (r"\[{'role': 'user', 'content': '", '{"conversations": [{"from": "human", "value": "'),
        (r'\}\n \{', '}, {'),  # Replace }\n { with }, {
        (r"\[{'role': 'system', 'content': '", '{"conversations": [{"from": "human", "value": "'),  # New pattern for 'system' role
    ]

def correct_json_format(json_string):
    # Correct common string format issues
    #corrected_string = json_string.replace("\\n", "\n")
    # Replace specific patterns
    corrected_string = json_string.replace("\"}\n {'role': 'user', 'content': '", "\"}, {\"from\": \"human\", \"value\": \"")
    corrected_string = corrected_string.replace("'}\n {'role': 'user', 'content': '", "\"}, {\"from\": \"human\", \"value\": \"")
    corrected_string = corrected_string.replace("'}\n {'role': 'user', 'content': \"", "\"}, {\"from\": \"human\", \"value\": \"")

    corrected_string = corrected_string.replace("'}\n {'role': 'assistant', 'content': \"", "\"}, {\"from\": \"gpt\", \"value\": \"")
    corrected_string = corrected_string.replace("'}\n {'role': 'assistant', 'content': '", "\"}, {\"from\": \"gpt\", \"value\": \"")
    corrected_string = corrected_string.replace("\"}\n {'role': 'assistant', 'content': '", "\"}, {\"from\": \"gpt\", \"value\": \"")

    # Return the corrected string
    return corrected_string

This is the target format (from UltraChat):

{"conversations": [{"from": "human", "value": "Are there any X?"}, {"from": "gpt", "value": "Yes, there are X"}, {"from": "human", "value": "That sounds great! Can you Y?"}, {"from": "gpt", "value": "Sure, here are Y"}]}
{"conversations": [{"from": "human", "value": "What percentage A?"}, {"from": "gpt", "value": "About 71%."}, {"from": "human", "value": "Wow, that's B"}, {"from": "gpt", "value": "Yes, it certainly is! "}]}

Like, am i doing this correctly? I'm completely lost at this point.

thank you for your help!

hamishivi commented 7 months ago

Hi!

1) Yes, we randomly subsampled many of the datasets to try and reduce the overall dataset size. More careful curation (as opposed to random sampling) would probably help improve performance, and there is lots of interesting research being done (and to be done) in this direction :)

2) Huggingface will automatically load the json into python-native formats, so you shouldn't be trying to do regex. I think rather what you want to do is something like this:

from datasets import load_dataset
ds = load_dataset('allenai/tulu-v2-sft-mixture')
new_data = []
for sample in ds:
    messages = sample['messages']
    conversations = []
    for message in messages:
        conversations.append({"from": message['role'], 'value': message['content']})
    new_data.append(conversations) # one sample is now converted
# save new_data

nuoma commented 7 months ago

That works! Leaving the code for reference. Thank you for your kind response and best wishes to you.

from datasets import load_dataset
import json

ds = load_dataset('allenai/tulu-v2-sft-mixture')

new_data = []
# Access the 'train' subset of the dataset
train_dataset = ds['train']

# Now you can iterate over the samples in the train dataset
new_data = []

for sample in train_dataset:
    messages = sample['messages']
    conversation = []  # This will hold the conversation pairs
    for message in messages:
        # Change 'user' and 'assistant' to 'human' and 'gpt'
        role = 'human' if message['role'] == 'user' else 'gpt'
        conversation.append({"from": role, 'value': message['content']})
    new_data.append({"conversations": conversation})  # Wrap the conversation in another dict

# save new_data
# Write to a JSONL file
with open('/content/drive/MyDrive/tulu_Sharegpt.jsonl', 'w', encoding='utf-8') as jsonl_file:
    for message in new_data:
        jsonl_file.write(json.dumps(message, ensure_ascii=False) + '\n')

allenai / open-instruct

about tulu v2 conversion #88