Closed andreaskoepf closed 1 year ago
If we could have a column for "URLs and references used" in each turn (and ideally a matching input field in the front-end), it would be useful for fine-tuning information retrieval later on. I think something very simple is enough. Example:
"urls": ["https://arxiv.org/abs/1706.03762"],
The rest can be done later when processing records into training data, as long as we have the URLs.
(If this is the wrong place for this suggestion, I apologize and will move/delete the comment.)
I like the jsonl format proposed. It seems sensible to have a consistent format for all our data, and json in general is more flexible if we do want to extend it later on. If we confirm this change we should make sure the docs are updated at the same time /cc @Vechtomov
Obviously jsonl is easier for storing and processing dialogs and especially multi-turn dialogs. I think we can use it for these types of datasets. /cc @christophschuhmann
I also like jsonl.
I would also add +1 to the suggested jsonl
format.
btw, if we find a consensus on the format, would there be any action item from this proposal?
I'll make a PR. But I found a little confusing behavior: when you upload a jsonl
file via Dataset("dataset.jsonl").push_to_hub(...)
it is converted into parquet. Also even if you upload the file manually or via git lfs
it will still be converted internally to parquet on your machine when you download it via load_dataset
.
it is converted into parquet. Also even if you upload the file manually or via
git lfs
it will still be converted internally to parquet on your machine when you download it viaload_dataset
OK, we probably have to look which ways are available to customize the Huggingface datasets library loading-mechanism. Research could start here: https://huggingface.co/docs/datasets/dataset_script
We need a dataset file format that allows multi-turn conversations. Currently we ask people to contribute datasets as parquet files with a simple column structure: INSTRUCTION, RESPONSE, SOURCE, METADATA, see datasets/README.md.
In the Open-Assistant HF collection backend we use
jsonl
(orjsonl.gz
) as import/export file format. We could use a thread variant of this format to store multi-turn conversations and use it as our official OA conversation dataset format. The core structure would look as follows (here shown formatted with indentation, in the jsonl files it would be encoded as onejson
object per line):This format would be "compatible" to the full oasst import/export format. A full thread export from oaast looks as follows (again shown indented for readability):
The additional properties shown here are optional, only the
"text"
field would really be mandatory (and maybe"role"
) for each message. The"lang"
field could be added for multi-lingual datasets. Additional properties could be added in custom fields.Handling jsonl
In many languages
jsonl
data can be generated and parsed easily (i.e only with standard libraries in very few lines). The following is an example of loading jsonl (used in model_training/custom_datasets/oasst_dataset.py):The main loading code for
jsonl
/jsonl.gz
in python is as simple as:parquet vs. jsonl
Here are some (very subjective and not complete) cons for parquet & jsonl:
parquet:
jsonl:
Alternative multi-turn tabular format
(In case we cannot agree on jsonl/oa format an alternative would be to define a tabular multi-turn conversations format that is close to the current one by adding two columns like
CONVERSATION_ID
andROUND
:This would be similar to other datasets like the empathetic_dialogues dataset.)
There many different dataset formats for multi-turn conversations, same examples:
__eou__
)