Empty message content in tulu-v2-sft-mixture hosted on HF

Hello,

First, thanks for sharing your work!

While using the published tulu-v2-sft-mixture hosted on HF, I noticed that some examples have empty content. While that is filtered during the loading/tokenization process, I believe that it will be worth removing these examples if you plan to release a new version on HF.

I'm using the latest dataset revision (3c6b232f7fe8f8325143eb19554111562423d768).

The following are the stats for the empty examples per dataset:

code_alpaca: 6
flan_v2: 13
sharegpt: 202
wizardlm: 49

Below is an example of the empty content (notice {'content': '', 'role': 'assistant'})

{'dataset': 'flan_v2',
 'empty': True,
 'id': 'flan_v2_40491',
 'messages': [{'content': 'In this task, you are given an input list A. You '
                          'need to find all the elements of the list that are '
                          'numbers in the same order as they appear in the '
                          'list A.\n'
                          '\n'
                          "Let me give you an example: ['238', 'h', '92', 'U', "
                          "'2799']\n"
                          'The answer to this example can be: 238, 92, 2799\n'
                          'Here is why: Here, the numbers in the input list in '
                          "order are '238', '92' and '2799'.\n"
                          '\n'
                          'OK. solve this:\n'
                          "['s', 'F', 'W']\n"
                          'Answer:',
               'role': 'user'},
              {'content': '', 'role': 'assistant'}]}

For `sharegpt, the problematic examples have an empty user turn before the assistant turn. This is not as problematic as Flan but it will generate useless chat template tokens.

{'dataset': 'sharegpt',
 'empty': True,
 'id': 'sharegpt_FeEfnVj_8',
 'messages': [{'content': 'continue chat', 'role': 'user'},
              {'content': '### Adding a column\n'
                          '\n'
                          'You can add a new column to a Pandas DataFrame by '
                          'assigning a Series or list to a new column name:\n'
                          '```python\n'
                          "data = [['Alice', 25], ['Bob', 30], ['Charlie', "
                          '35]]\n'
                          "columns = ['name', 'age']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          '# Add a new column with a constant value\n'
                          "df['gender'] = 'F'\n"
                          '\n'
                          '# Add a new column with a Series\n'
                          "s = pd.Series(['F', 'M', 'M'], index=['A', 'B', "
                          "'C'])\n"
                          "df['gender'] = s\n"
                          '```\n'
                          '### Adding row indices\n'
                          '\n'
                          'You can add row indices to a Pandas DataFrame using '
                          'the `index` parameter:\n'
                          '```python\n'
                          "data = [['Alice', 25], ['Bob', 30], ['Charlie', "
                          '35]]\n'
                          "columns = ['name', 'age']\n"
                          "index = ['A', 'B', 'C']\n"
                          'df = pd.DataFrame(data, columns=columns, '
                          'index=index)\n'
                          '```\n'
                          '### Convert one of the columns to an index\n'
                          '\n'
                          'You can convert one of the columns in a Pandas '
                          'DataFrame to the index using the `set_index` '
                          'method:\n'
                          '```python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "df.set_index('name', inplace=True)\n"
                          '```\n'
                          '### Recode an index\n'
                          '\n'
                          'You can recode the values of an index in a Pandas '
                          'DataFrame using the `rename` method:\n'
                          '```python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "df.set_index('name', inplace=True)\n"
                          "df.rename(index={'Alice': 'A', 'Bob': 'B', "
                          "'Charlie': 'C'}, inplace=True)\n"
                          '```\n'
                          '### Use multiple columns for indexing\n'
                          '\n'
                          'You can use multiple columns as the index of a '
                          'Pandas DataFrame by passing a list of column names '
                          'to the `set_index` method:\n'
                          '```python\n'
                          "data = [['Alice', 'Smith', 25], ['Bob', 'Johnson', "
                          "30], ['Charlie', 'Brown', 35]]\n"
                          "columns = ['first_name', 'last_name', 'age']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "df.set_index(['first_name', 'last_name'], "
                          'inplace=True)\n'
                          '```\n'
                          '### Accessing values in a DataFrame\n'
                          '\n'
                          'You can access values in a Pandas DataFrame using '
                          'various methods:\n'
                          '```python\n'
                          "data = [['Alice', 25], ['Bob', 30], ['Charlie', "
                          '35]]\n'
                          "columns = ['name', 'age']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "df['name']  # Access column by name\n"
                          'df.loc[0]  # Access row by index label\n'
                          'df.iloc[0]  # Access row by index position\n'
                          "df.loc[0, 'name']  # Access element by index label "
                          'and column name\n'
                          'df.iloc[0, 0]  # Access element by index position '
                          'and column position\n'
                          '```\n'
                          '### Reading CSV files\n'
                          '\n'
                          'You can read a CSV file into a Pandas DataFrame '
                          'using the `read_csv` function:\n'
                          '```python\n'
                          "df = pd.read_csv('filename.csv')\n"
                          '```\n'
                          '### Filtering data\n'
                          '\n'
                          'You can filter rows in a Pandas DataFrame using '
                          'boolean indexing:\n'
                          '```python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data\n'
                          '```',
               'role': 'assistant'},
              {'content': 'Continue chat', 'role': 'user'},
              {'content': 'columns=columns)\n'
                          '\n'
                          'Filter rows where age is greater than 30\n'
                          '========================================\n'
                          '\n'
                          "df[df['age'] > 30]\n"
                          '\n'
                          "Filter rows where gender is 'F'\n"
                          '===============================\n'
                          '\n'
                          "df[df['gender'] == 'F']\n"
                          '```bash\n'
                          '\n'
                          '### Sorting\n'
                          '\n'
                          'You can sort a Pandas DataFrame using the '
                          '`sort_values` method:\n'
                          '```python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          '# Sort by age in ascending order\n'
                          "df.sort_values('age')\n"
                          '\n'
                          '# Sort by age in descending order\n'
                          "df.sort_values('age', ascending=False)\n"
                          '\n'
                          '```### Summaries and descriptive statistics\n'
                          '\n'
                          'You can compute summaries and descriptive '
                          'statistics of a Pandas DataFrame using various '
                          'methods:\n'
                          '```\n'
                          'python\n'
                          "data = [['Alice', 25, 'F'], ['Bob', 30, 'M'], "
                          "['Charlie', 35, 'M']]\n"
                          "columns = ['name', 'age', 'gender']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          'df.describe()  # Compute summary statistics\n'
                          "df['age'].sum()  # Compute sum of column\n"
                          "df['age'].mean()  # Compute mean of column\n"
                          "df['age'].median()  # Compute median of column\n"
                          "df['age'].max()  # Compute maximum of column\n"
                          "df['age'].min()  # Compute minimum of column\n"
                          "df['gender'].value_counts()  # Compute counts of "
                          'unique values in column\n'
                          '```\n'
                          '### Descriptive statistics for categorical data\n'
                          '\n'
                          'You can compute descriptive statistics for '
                          'categorical data in a Pandas DataFrame using the '
                          '`crosstab` function:\n'
                          '```python\n'
                          "data = [['Alice', 'F', 'NY'], ['Bob', 'M', 'LA'], "
                          "['Charlie', 'M', 'NY']]\n"
                          "columns = ['name', 'gender', 'city']\n"
                          'df = pd.DataFrame(data, columns=columns)\n'
                          '\n'
                          "pd.crosstab(df['gender'], df['city'])\n"
                          '```',
               'role': 'assistant'},
              {'content': '', 'role': 'user'},
              {'content': 'Do you have any questions or is there anything you '
                          'would like me to explain further?',
               'role': 'assistant'}]}

Below is the code I used

from datasets import load_dataset

def flag_empty(e):
    e["empty"] = False
    for m in e["messages"]:
        if m["content"].strip() == "" and m["role"] != "system":
            e["empty"] = True
    return e

dataset = load_dataset("allenai/tulu-v2-sft-mixture", split="train")
_dataset = dataset.map(flag_empty, num_proc=8)
_empty_ds = _dataset.filter(lambda x: x["empty"], num_proc=8)

Thanks,

allenai / open-instruct

Empty message content in tulu-v2-sft-mixture hosted on HF #161