issues
search
allenai
/
dolma
Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
854
stars
82
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1?
#177
john-hewitt
opened
3 days ago
1
added option for tokenizer to split on special tokens
#176
soldni
closed
5 days ago
0
Issue with ring tokenizer
#175
davidbrandfonbrener
closed
1 week ago
1
Deduplication / Decontamination
#174
chschroeder
opened
3 weeks ago
0
Need help in customizing python/dolma/taggers/c4.py
#173
mihara-bot
opened
4 weeks ago
0
Need clarification of Gopher in Step 2
#172
mihara-bot
opened
1 month ago
0
Better Filters Error Handling
#171
soldni
opened
1 month ago
0
Adds ZST support in Deduper and Mixer
#170
soldni
closed
1 month ago
0
Workaround to fix memory leak in HuggingFace tokenizer
#169
soldni
closed
1 month ago
1
Need help on accessing the raw reddit data
#168
Jianxin-MNM
closed
3 weeks ago
1
PyPI release
#167
baberabb
opened
1 month ago
0
dedupe.documents.attribute_name does not work
#166
mathCrazyy
opened
1 month ago
1
New Progress Bar, Backoff, Batching
#165
soldni
opened
1 month ago
0
New Progress Bar code
#164
soldni
closed
1 month ago
0
Adding Quality Classifier from Dolma 1.7
#163
soldni
closed
1 month ago
0
Clarification Needed on "C4 NoPunc" in Data Processing
#162
codefly13
closed
1 month ago
1
Adding partition logic
#161
Whattabatt
closed
1 week ago
0
Warc Backoff
#160
soldni
opened
2 months ago
0
[EXPERIMENT ONLY, NOT FOR MERGING] Exporting First 200 Text
#159
power10dan
opened
2 months ago
0
Need help for installing dolma
#158
mihara-bot
closed
1 month ago
3
Duplicate ids in Dolma v1.7
#157
Vedaad-Shakib
opened
2 months ago
0
Reducing hash calls
#156
Whattabatt
closed
2 months ago
0
Bump rustls from 0.21.11 to 0.21.12 in the cargo group across 1 directory
#155
dependabot[bot]
closed
2 months ago
0
Fixing dtype option not being correctly propagated
#154
soldni
closed
2 months ago
0
Add support for parsing WARC
#153
soldni
closed
2 months ago
0
dtype option is not working as expected
#152
tokenizer-decode
closed
2 months ago
1
Inquiry about Web Pipeline Availability
#151
codefly13
opened
2 months ago
2
Running paragraph level deduplication on c4
#150
andrewhojel
opened
2 months ago
2
Bump rustls from 0.21.10 to 0.21.11 in the cargo group across 1 directory
#149
dependabot[bot]
closed
2 months ago
0
fix divide by 0 in gopher tagger
#148
peterbjorgensen
closed
2 months ago
0
Bump s3 client lib and parameterize region in s3 tests + devcontainer
#147
undfined
closed
3 months ago
0
Bump h2 from 0.3.24 to 0.3.26 in the cargo group across 1 directory
#146
dependabot[bot]
closed
3 months ago
1
Add extra tests for multi-byte unicode spans in deduper.
#145
soldni
closed
3 months ago
0
Optionally add total/sum to output of analyzer
#144
soldni
closed
3 months ago
0
S3 mixer doesn't start
#143
marcopasqua
opened
3 months ago
2
Data out of bounds when using ‘dolma tokens --dtype uint32’
#142
Jackwaterveg
opened
3 months ago
1
Add an option to improve tokenization shuffling
#141
soldni
closed
3 months ago
0
Fix local shuffling failure
#140
soldni
closed
3 months ago
0
Possible bug in `local_shuffle`?
#139
hwijeen
closed
3 months ago
2
Some race condition in url taggers
#138
peterbjorgensen
opened
3 months ago
1
use precompiled regex when loading url blocklists
#137
peterbjorgensen
closed
3 months ago
0
Is there a way to intergratge Dolma toolkit to Spark?
#136
DangoWang
opened
4 months ago
1
Improves tool to compute statistics; adds deduplication options.
#135
soldni
closed
4 months ago
0
A Question about the meaning of dolma_v1.6_cc_en
#134
aleien95
closed
3 months ago
1
Added JQ syntax for replacements + added minimum score.
#133
soldni
closed
4 months ago
0
Bump the cargo group group with 1 update
#132
dependabot[bot]
closed
4 months ago
0
Added Support for JQ syntax in include/exclude mixer config
#131
soldni
closed
4 months ago
0
Support providing streams into mixer via CLI
#130
soldni
opened
4 months ago
0
Tagger modules import (fix for #128)
#129
soldni
closed
4 months ago
0
tagger_modules do not work in current git version
#128
peterbjorgensen
closed
4 months ago
1
Next