issues
search
allenai
/
dolma
Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
909
stars
94
forks
source link
V1.0 candidate; new deduper options, new taggers
#100
Closed
soldni
closed
7 months ago
soldni
commented
8 months ago
Version 1.0 candidate
Added
--dedupe.min_words
and
--dedupe.min_length
to deduplication tool to filter documents/paragraphs under specific length
Added taggers to detect repetitions
Added support for custom BOS/EOS/PAD tokens in tokenize command
Improved speed of tokenizer when using Llama
Preliminary integration of WIMBD tools in Dolma
New tests
Added configurations to recreate Dolma v1.5 and v1.6
Added Dolma manuscript
--dedupe.min_words
and--dedupe.min_length
to deduplication tool to filter documents/paragraphs under specific length