Proposal: Use OA compatible jsonl message format for multi-turn conversations

andreaskoepf commented 1 year ago

We need a dataset file format that allows multi-turn conversations. Currently we ask people to contribute datasets as parquet files with a simple column structure: INSTRUCTION, RESPONSE, SOURCE, METADATA, see datasets/README.md.

In the Open-Assistant HF collection backend we use jsonl (or jsonl.gz) as import/export file format. We could use a thread variant of this format to store multi-turn conversations and use it as our official OA conversation dataset format. The core structure would look as follows (here shown formatted with indentation, in the jsonl files it would be encoded as one json object per line):

{
    "thread": [
        {
            "text": "Hola, \u00bfqu\u00e9 eres?",
            "role": "prompter"
        },
        {
            "text": "Soy una inteligencia Artificial (..)",
            "role": "assistant"
        }
    ],
    "source": "wikipedia",
    "meta": { "value": 123 },
}

This format would be "compatible" to the full oasst import/export format. A full thread export from oaast looks as follows (again shown indented for readability):

{
    "thread": [
        {
            "message_id": "77b151ac-e001-4b19-9afd-eb9cabf5cfbc",
            "text": "What are some of the pro's and con's of social media?",
            "role": "prompter",
            "lang": "en",
            "review_count": 3,
            "review_result": true,
            "deleted": false,
            "synthetic": false,
            "emojis": {
                "+1": 6,
                "_skip_reply": 1,
                "_skip_ranking": 1
            }
        },
        {
            "message_id": "d80c6b1b-4c50-4d07-a20e-56476fc6e4ce",
            "parent_id": "77b151ac-e001-4b19-9afd-eb9cabf5cfbc",
            "text": "Here are some potential pros and cons of social media: (..)",
            "role": "assistant",
            "lang": "en",
            "review_count": 3,
            "review_result": true,
            "deleted": false,
            "rank": 0,
            "synthetic": false,
            "emojis": {
                "+1": 6
            }
        },
        {
            "message_id": "3f458cb6-4b61-40cd-96fe-b6d7c06a2c53",
            "parent_id": "d80c6b1b-4c50-4d07-a20e-56476fc6e4ce",
            "text": "Why does it affect mental health?",
            "role": "prompter",
            "lang": "en",
            "review_count": 3,
            "review_result": true,
            "deleted": false,
            "synthetic": false,
            "emojis": {
                "+1": 2,
                "_skip_reply": 1,
                "_skip_ranking": 1,
                "_skip_labeling": 1
            }
        },
        {
            "message_id": "fa12350e-8899-49b8-842b-f82cd6bc8676",
            "parent_id": "3f458cb6-4b61-40cd-96fe-b6d7c06a2c53",
            "text": "Social media can affect mental health in many ways(..)",
            "role": "assistant",
            "lang": "en",
            "review_count": 3,
            "review_result": true,
            "deleted": false,
            "rank": 0,
            "synthetic": false,
            "emojis": {
                "+1": 2,
                "_skip_labeling": 2
            },
            "labels": {
                "spam": {
                    "value": 0.0,
                    "count": 3
                },
                "fails_task": {
                    "value": 0.0,
                    "count": 2
                },
                "lang_mismatch": {
                    "value": 0.0,
                    "count": 3
                },
                "pii": {
                    "value": 0.0,
                    "count": 2
                },
                "not_appropriate": {
                    "value": 0.0,
                    "count": 2
                },
                "hate_speech": {
                    "value": 0.0,
                    "count": 2
                },
                "sexual_content": {
                    "value": 0.0,
                    "count": 2
                },
                "quality": {
                    "value": 0.5,
                    "count": 3
                },
                "toxicity": {
                    "value": 0.0,
                    "count": 2
                },
                "humor": {
                    "value": 0.0,
                    "count": 2
                },
                "helpfulness": {
                    "value": 0.5,
                    "count": 2
                },
                "creativity": {
                    "value": 0.25,
                    "count": 2
                },
                "violence": {
                    "value": 0.0,
                    "count": 2
                }
            }
        }
    ]
}

The additional properties shown here are optional, only the "text" field would really be mandatory (and maybe "role") for each message. The "lang" field could be added for multi-lingual datasets. Additional properties could be added in custom fields.

Handling jsonl

In many languages jsonl data can be generated and parsed easily (i.e only with standard libraries in very few lines). The following is an example of loading jsonl (used in model_training/custom_datasets/oasst_dataset.py):

The main loading code for jsonl/jsonl.gz in python is as simple as:

    if input_file_path.suffix == ".gz":
        file_in = gzip.open(str(input_file_path), mode="tr", encoding="UTF-8")
    else:
        file_in = input_file_path.open("r", encoding="UTF-8")

    with file_in:
        # read one message tree per line
        for line in file_in:
            dict_tree = json.loads(line)

            # validate data
            tree: ExportMessageTree = pydantic.parse_obj_as(ExportMessageTree, dict_tree)

parquet vs. jsonl

Here are some (very subjective and not complete) cons for parquet & jsonl:

parquet:

tabular data, makes it hard to store nested things or additional properties
harder to process than json (requires non-standard libs)
not specifically optimized for variable size data like text data
has row-group binary structure which seems optimal for columns with similar values or to load subset of column, but features are not required or used in OA code (to my knowledge)
advantage of compression feature compared to jsonl.gz (or other compressors) questionable, parquet theoretically allows partial loading with correct row-group size, but also rarely used

jsonl:

repeats property names for each element (requires compression for efficient storage)
string become a bit longer due to standard encoding/escaping of special characters
seeking in file natively not possible, normally read begin to end
allows more freedom of the structure, e.g. arbitrary complex json (can make it harder to handle)

Alternative multi-turn tabular format

(In case we cannot agree on jsonl/oa format an alternative would be to define a tabular multi-turn conversations format that is close to the current one by adding two columns like CONVERSATION_ID and ROUND:

CONVERSATION_ID (string)
ROUND (int32)
INSTRUCTION (string): Instruction text
RESPONSE (string): Expected response to the instruction
SOURCE (string): Original data source short name, e.g. "wikipedia"
METADATA (JSON string, optional): Any other useful information stored in JSON

This would be similar to other datasets like the empathetic_dialogues dataset.)

There many different dataset formats for multi-turn conversations, same examples:

empathetic_dialogues (CSV)
conv_ai_2 (json)
blended_skill_talk (json)
daily_dialog (custom zip compressed text with markers like __eou__)
personachat (CSV)

umbra-scientia commented 1 year ago

If we could have a column for "URLs and references used" in each turn (and ideally a matching input field in the front-end), it would be useful for fine-tuning information retrieval later on. I think something very simple is enough. Example:

    "urls": ["https://arxiv.org/abs/1706.03762"],

The rest can be done later when processing records into training data, as long as we have the URLs.

(If this is the wrong place for this suggestion, I apologize and will move/delete the comment.)

dctanner commented 1 year ago

I like the jsonl format proposed. It seems sensible to have a consistent format for all our data, and json in general is more flexible if we do want to extend it later on. If we confirm this change we should make sure the docs are updated at the same time /cc @Vechtomov

Vechtomov commented 1 year ago

Obviously jsonl is easier for storing and processing dialogs and especially multi-turn dialogs. I think we can use it for these types of datasets. /cc @christophschuhmann

huu4ontocord commented 1 year ago

I also like jsonl.

totuta commented 1 year ago

I would also add +1 to the suggested jsonl format.

totuta commented 1 year ago

btw, if we find a consensus on the format, would there be any action item from this proposal?

Vechtomov commented 1 year ago

I'll make a PR. But I found a little confusing behavior: when you upload a jsonl file via Dataset("dataset.jsonl").push_to_hub(...) it is converted into parquet. Also even if you upload the file manually or via git lfs it will still be converted internally to parquet on your machine when you download it via load_dataset.

andreaskoepf commented 1 year ago

it is converted into parquet. Also even if you upload the file manually or via git lfs it will still be converted internally to parquet on your machine when you download it via load_dataset

OK, we probably have to look which ways are available to customize the Huggingface datasets library loading-mechanism. Research could start here: https://huggingface.co/docs/datasets/dataset_script

LAION-AI / Open-Assistant