Closed nuoma closed 7 months ago
Hi!
1) Yes, we randomly subsampled many of the datasets to try and reduce the overall dataset size. More careful curation (as opposed to random sampling) would probably help improve performance, and there is lots of interesting research being done (and to be done) in this direction :)
2) Huggingface will automatically load the json into python-native formats, so you shouldn't be trying to do regex. I think rather what you want to do is something like this:
from datasets import load_dataset
ds = load_dataset('allenai/tulu-v2-sft-mixture')
new_data = []
for sample in ds:
messages = sample['messages']
conversations = []
for message in messages:
conversations.append({"from": message['role'], 'value': message['content']})
new_data.append(conversations) # one sample is now converted
# save new_data
That works! Leaving the code for reference. Thank you for your kind response and best wishes to you.
from datasets import load_dataset
import json
ds = load_dataset('allenai/tulu-v2-sft-mixture')
new_data = []
# Access the 'train' subset of the dataset
train_dataset = ds['train']
# Now you can iterate over the samples in the train dataset
new_data = []
for sample in train_dataset:
messages = sample['messages']
conversation = [] # This will hold the conversation pairs
for message in messages:
# Change 'user' and 'assistant' to 'human' and 'gpt'
role = 'human' if message['role'] == 'user' else 'gpt'
conversation.append({"from": role, 'value': message['content']})
new_data.append({"conversations": conversation}) # Wrap the conversation in another dict
# save new_data
# Write to a JSONL file
with open('/content/drive/MyDrive/tulu_Sharegpt.jsonl', 'w', encoding='utf-8') as jsonl_file:
for message in new_data:
jsonl_file.write(json.dumps(message, ensure_ascii=False) + '\n')
Hi, I have a two questions regarding the tulu v2 dataset.
Q1. If I'm understanding correctly, the reformat_datasets.py is used to reproduce the tulu dataset. It converts a bunch of different datasets and concat into tulu v2. I can see random subsamplings, have you considered a more curated approach of subsampling from these dataset? Will it help further increase performance? (or perhaps the granularity of this research question belongs to a different study)
Q2. I am trying to convert the tulu v2 dataset into sharegpt multiturn format to adapt to my existing code. I first download the 3 parquet files, turn them into jsonl, and turn the format into sharegpt format. However, I find it extremely hard to perform the conversion, either by regex or even manually. Somehow there are always some JSON formatting issues that hf.dataset fails to deal with. Here is some code snippet that i tried to use:
This is the target format (from UltraChat):
Like, am i doing this correctly? I'm completely lost at this point.
thank you for your help!