LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
37.06k stars 3.24k forks source link

Proposal: Use OA compatible jsonl message format for multi-turn conversations #1911

Closed andreaskoepf closed 1 year ago

andreaskoepf commented 1 year ago

We need a dataset file format that allows multi-turn conversations. Currently we ask people to contribute datasets as parquet files with a simple column structure: INSTRUCTION, RESPONSE, SOURCE, METADATA, see datasets/README.md.

In the Open-Assistant HF collection backend we use jsonl (or jsonl.gz) as import/export file format. We could use a thread variant of this format to store multi-turn conversations and use it as our official OA conversation dataset format. The core structure would look as follows (here shown formatted with indentation, in the jsonl files it would be encoded as one json object per line):

{
    "thread": [
        {
            "text": "Hola, \u00bfqu\u00e9 eres?",
            "role": "prompter"
        },
        {
            "text": "Soy una inteligencia Artificial (..)",
            "role": "assistant"
        }
    ],
    "source": "wikipedia",
    "meta": { "value": 123 },
}

This format would be "compatible" to the full oasst import/export format. A full thread export from oaast looks as follows (again shown indented for readability):

{
    "thread": [
        {
            "message_id": "77b151ac-e001-4b19-9afd-eb9cabf5cfbc",
            "text": "What are some of the pro's and con's of social media?",
            "role": "prompter",
            "lang": "en",
            "review_count": 3,
            "review_result": true,
            "deleted": false,
            "synthetic": false,
            "emojis": {
                "+1": 6,
                "_skip_reply": 1,
                "_skip_ranking": 1
            }
        },
        {
            "message_id": "d80c6b1b-4c50-4d07-a20e-56476fc6e4ce",
            "parent_id": "77b151ac-e001-4b19-9afd-eb9cabf5cfbc",
            "text": "Here are some potential pros and cons of social media: (..)",
            "role": "assistant",
            "lang": "en",
            "review_count": 3,
            "review_result": true,
            "deleted": false,
            "rank": 0,
            "synthetic": false,
            "emojis": {
                "+1": 6
            }
        },
        {
            "message_id": "3f458cb6-4b61-40cd-96fe-b6d7c06a2c53",
            "parent_id": "d80c6b1b-4c50-4d07-a20e-56476fc6e4ce",
            "text": "Why does it affect mental health?",
            "role": "prompter",
            "lang": "en",
            "review_count": 3,
            "review_result": true,
            "deleted": false,
            "synthetic": false,
            "emojis": {
                "+1": 2,
                "_skip_reply": 1,
                "_skip_ranking": 1,
                "_skip_labeling": 1
            }
        },
        {
            "message_id": "fa12350e-8899-49b8-842b-f82cd6bc8676",
            "parent_id": "3f458cb6-4b61-40cd-96fe-b6d7c06a2c53",
            "text": "Social media can affect mental health in many ways(..)",
            "role": "assistant",
            "lang": "en",
            "review_count": 3,
            "review_result": true,
            "deleted": false,
            "rank": 0,
            "synthetic": false,
            "emojis": {
                "+1": 2,
                "_skip_labeling": 2
            },
            "labels": {
                "spam": {
                    "value": 0.0,
                    "count": 3
                },
                "fails_task": {
                    "value": 0.0,
                    "count": 2
                },
                "lang_mismatch": {
                    "value": 0.0,
                    "count": 3
                },
                "pii": {
                    "value": 0.0,
                    "count": 2
                },
                "not_appropriate": {
                    "value": 0.0,
                    "count": 2
                },
                "hate_speech": {
                    "value": 0.0,
                    "count": 2
                },
                "sexual_content": {
                    "value": 0.0,
                    "count": 2
                },
                "quality": {
                    "value": 0.5,
                    "count": 3
                },
                "toxicity": {
                    "value": 0.0,
                    "count": 2
                },
                "humor": {
                    "value": 0.0,
                    "count": 2
                },
                "helpfulness": {
                    "value": 0.5,
                    "count": 2
                },
                "creativity": {
                    "value": 0.25,
                    "count": 2
                },
                "violence": {
                    "value": 0.0,
                    "count": 2
                }
            }
        }
    ]
}

The additional properties shown here are optional, only the "text" field would really be mandatory (and maybe "role") for each message. The "lang" field could be added for multi-lingual datasets. Additional properties could be added in custom fields.

Handling jsonl

In many languages jsonl data can be generated and parsed easily (i.e only with standard libraries in very few lines). The following is an example of loading jsonl (used in model_training/custom_datasets/oasst_dataset.py):

The main loading code for jsonl/jsonl.gz in python is as simple as:

    if input_file_path.suffix == ".gz":
        file_in = gzip.open(str(input_file_path), mode="tr", encoding="UTF-8")
    else:
        file_in = input_file_path.open("r", encoding="UTF-8")

    with file_in:
        # read one message tree per line
        for line in file_in:
            dict_tree = json.loads(line)

            # validate data
            tree: ExportMessageTree = pydantic.parse_obj_as(ExportMessageTree, dict_tree)

parquet vs. jsonl

Here are some (very subjective and not complete) cons for parquet & jsonl:

parquet:

jsonl:

Alternative multi-turn tabular format

(In case we cannot agree on jsonl/oa format an alternative would be to define a tabular multi-turn conversations format that is close to the current one by adding two columns like CONVERSATION_ID and ROUND:

  1. CONVERSATION_ID (string)
  2. ROUND (int32)
  3. INSTRUCTION (string): Instruction text
  4. RESPONSE (string): Expected response to the instruction
  5. SOURCE (string): Original data source short name, e.g. "wikipedia"
  6. METADATA (JSON string, optional): Any other useful information stored in JSON

This would be similar to other datasets like the empathetic_dialogues dataset.)

There many different dataset formats for multi-turn conversations, same examples:

umbra-scientia commented 1 year ago

If we could have a column for "URLs and references used" in each turn (and ideally a matching input field in the front-end), it would be useful for fine-tuning information retrieval later on. I think something very simple is enough. Example:

    "urls": ["https://arxiv.org/abs/1706.03762"],

The rest can be done later when processing records into training data, as long as we have the URLs.

(If this is the wrong place for this suggestion, I apologize and will move/delete the comment.)

dctanner commented 1 year ago

I like the jsonl format proposed. It seems sensible to have a consistent format for all our data, and json in general is more flexible if we do want to extend it later on. If we confirm this change we should make sure the docs are updated at the same time /cc @Vechtomov

Vechtomov commented 1 year ago

Obviously jsonl is easier for storing and processing dialogs and especially multi-turn dialogs. I think we can use it for these types of datasets. /cc @christophschuhmann

huu4ontocord commented 1 year ago

I also like jsonl.

totuta commented 1 year ago

I would also add +1 to the suggested jsonl format.

totuta commented 1 year ago

btw, if we find a consensus on the format, would there be any action item from this proposal?

Vechtomov commented 1 year ago

I'll make a PR. But I found a little confusing behavior: when you upload a jsonl file via Dataset("dataset.jsonl").push_to_hub(...) it is converted into parquet. Also even if you upload the file manually or via git lfs it will still be converted internally to parquet on your machine when you download it via load_dataset.

andreaskoepf commented 1 year ago

it is converted into parquet. Also even if you upload the file manually or via git lfs it will still be converted internally to parquet on your machine when you download it via load_dataset

OK, we probably have to look which ways are available to customize the Huggingface datasets library loading-mechanism. Research could start here: https://huggingface.co/docs/datasets/dataset_script