LAION-AI / Open-Assistant

OpenAssistant is a chat-based assistant that understands tasks, can interact with third-party systems, and retrieve information dynamically to do so.
https://open-assistant.io
Apache License 2.0
36.94k stars 3.22k forks source link

Add chunking of pretrain text modeling datasets #3586

Closed andreaskoepf closed 1 year ago

andreaskoepf commented 1 year ago

Text datasets like fanfics contain long entries. This PR splits dataset entries that exceed the specified max_chunk_size into multiple smaller entries.