allenai / open-instruct

Apache License 2.0
1.21k stars 166 forks source link

Empty message content in tulu-v2-sft-mixture hosted on HF #161

Closed Am1n3e closed 4 months ago

Am1n3e commented 5 months ago

Hello,

First, thanks for sharing your work!

While using the published tulu-v2-sft-mixture hosted on HF, I noticed that some examples have empty content. While that is filtered during the loading/tokenization process, I believe that it will be worth removing these examples if you plan to release a new version on HF.

I'm using the latest dataset revision (3c6b232f7fe8f8325143eb19554111562423d768).

The following are the stats for the empty examples per dataset:

Below is an example of the empty content (notice {'content': '', 'role': 'assistant'})

{'dataset': 'flan_v2',
 'empty': True,
 'id': 'flan_v2_40491',
 'messages': [{'content': 'In this task, you are given an input list A. You '
                          'need to find all the elements of the list that are '
                          'numbers in the same order as they appear in the '
                          'list A.\n'
                          '\n'
                          "Let me give you an example: ['238', 'h', '92', 'U', "
                          "'2799']\n"
                          'The answer to this example can be: 238, 92, 2799\n'
                          'Here is why: Here, the numbers in the input list in '
                          "order are '238', '92' and '2799'.\n"
                          '\n'
                          'OK. solve this:\n'
                          "['s', 'F', 'W']\n"
                          'Answer:',
               'role': 'user'},
              {'content': '', 'role': 'assistant'}]}

For `sharegpt, the problematic examples have an empty user turn before the assistant turn. This is not as problematic as Flan but it will generate useless chat template tokens.

{'dataset': 'sharegpt',
 'empty': True,
 'id': 'sharegpt_FeEfnVj_8',
 'messages': [{'content': 'continue chat', 'role': 'user'},
              {'content': '### Adding a column\n'
                          '\n'
                          'You can add a new column to a Pandas DataFrame by '
                          'assigning a Series or list to a new column name:\n'
                          '```python\n'
                          "data = [['Alice', 25], ['Bob', 30], ['Charlie', "
                          '35]]\n'
                          "columns = ['name', 'age']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          '# Add a new column with a constant value\n'
                          "df['gender'] = 'F'\n"
                          '\n'
                          '# Add a new column with a Series\n'
                          "s = pd.Series(['F', 'M', 'M'], index=['A', 'B', "
                          "'C'])\n"
                          "df['gender'] = s\n"
                          '```\n'
                          '### Adding row indices\n'
                          '\n'
                          'You can add row indices to a Pandas DataFrame using '
                          'the `index` parameter:\n'
                          '```python\n'
                          "data = [['Alice', 25], ['Bob', 30], ['Charlie', "
                          '35]]\n'
                          "columns = ['name', 'age']\n"
                          "index = ['A', 'B', 'C']\n"
                          'df = pd.DataFrame(data, columns=columns, '
                          'index=index)\n'
                          '```\n'
                          '### Convert one of the columns to an index\n'
                          '\n'
                          'You can convert one of the columns in a Pandas '
                          'DataFrame to the index using the `set_index` '
                          'method:\n'
                          '```python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "df.set_index('name', inplace=True)\n"
                          '```\n'
                          '### Recode an index\n'
                          '\n'
                          'You can recode the values of an index in a Pandas '
                          'DataFrame using the `rename` method:\n'
                          '```python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "df.set_index('name', inplace=True)\n"
                          "df.rename(index={'Alice': 'A', 'Bob': 'B', "
                          "'Charlie': 'C'}, inplace=True)\n"
                          '```\n'
                          '### Use multiple columns for indexing\n'
                          '\n'
                          'You can use multiple columns as the index of a '
                          'Pandas DataFrame by passing a list of column names '
                          'to the `set_index` method:\n'
                          '```python\n'
                          "data = [['Alice', 'Smith', 25], ['Bob', 'Johnson', "
                          "30], ['Charlie', 'Brown', 35]]\n"
                          "columns = ['first_name', 'last_name', 'age']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "df.set_index(['first_name', 'last_name'], "
                          'inplace=True)\n'
                          '```\n'
                          '### Accessing values in a DataFrame\n'
                          '\n'
                          'You can access values in a Pandas DataFrame using '
                          'various methods:\n'
                          '```python\n'
                          "data = [['Alice', 25], ['Bob', 30], ['Charlie', "
                          '35]]\n'
                          "columns = ['name', 'age']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "df['name']  # Access column by name\n"
                          'df.loc[0]  # Access row by index label\n'
                          'df.iloc[0]  # Access row by index position\n'
                          "df.loc[0, 'name']  # Access element by index label "
                          'and column name\n'
                          'df.iloc[0, 0]  # Access element by index position '
                          'and column position\n'
                          '```\n'
                          '### Reading CSV files\n'
                          '\n'
                          'You can read a CSV file into a Pandas DataFrame '
                          'using the `read_csv` function:\n'
                          '```python\n'
                          "df = pd.read_csv('filename.csv')\n"
                          '```\n'
                          '### Filtering data\n'
                          '\n'
                          'You can filter rows in a Pandas DataFrame using '
                          'boolean indexing:\n'
                          '```python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data\n'
                          '```',
               'role': 'assistant'},
              {'content': 'Continue chat', 'role': 'user'},
              {'content': 'columns=columns)\n'
                          '\n'
                          'Filter rows where age is greater than 30\n'
                          '========================================\n'
                          '\n'
                          "df[df['age'] > 30]\n"
                          '\n'
                          "Filter rows where gender is 'F'\n"
                          '===============================\n'
                          '\n'
                          "df[df['gender'] == 'F']\n"
                          '```bash\n'
                          '\n'
                          '### Sorting\n'
                          '\n'
                          'You can sort a Pandas DataFrame using the '
                          '`sort_values` method:\n'
                          '```python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          '# Sort by age in ascending order\n'
                          "df.sort_values('age')\n"
                          '\n'
                          '# Sort by age in descending order\n'
                          "df.sort_values('age', ascending=False)\n"
                          '\n'
                          '```### Summaries and descriptive statistics\n'
                          '\n'
                          'You can compute summaries and descriptive '
                          'statistics of a Pandas DataFrame using various '
                          'methods:\n'
                          '```\n'
                          'python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          'df.describe()  # Compute summary statistics\n'
                          "df['age'].sum()  # Compute sum of column\n"
                          "df['age'].mean()  # Compute mean of column\n"
                          "df['age'].median()  # Compute median of column\n"
                          "df['age'].max()  # Compute maximum of column\n"
                          "df['age'].min()  # Compute minimum of column\n"
                          "df['gender'].value_counts()  # Compute counts of "
                          'unique values in column\n'
                          '```\n'
                          '### Descriptive statistics for categorical data\n'
                          '\n'
                          'You can compute descriptive statistics for '
                          'categorical data in a Pandas DataFrame using the '
                          '`crosstab` function:\n'
                          '```python\n'
                          "data = [['Alice', 'F', 'NY'], ['Bob', 'M', 'LA'], "
                          "['Charlie', 'M', 'NY']]\n"
                          "columns = ['name', 'gender', 'city']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "pd.crosstab(df['gender'], df['city'])\n"
                          '```',
               'role': 'assistant'},
              {'content': '', 'role': 'user'},
              {'content': 'Do you have any questions or is there anything you '
                          'would like me to explain further?',
               'role': 'assistant'}]}

Below is the code I used

from datasets import load_dataset

def flag_empty(e):
    e["empty"] = False
    for m in e["messages"]:
        if m["content"].strip() == "" and m["role"] != "system":
            e["empty"] = True
    return e

dataset = load_dataset("allenai/tulu-v2-sft-mixture", split="train")
_dataset = dataset.map(flag_empty, num_proc=8)
_empty_ds = _dataset.filter(lambda x: x["empty"], num_proc=8)

Thanks,

hamishivi commented 4 months ago

Thanks for pointing this out! For reproducibility reasons, we will probably not remove these from the mixture release but we will definitely make sure to filter out these cases for future releases. I'll add a note on the huggingface page.