issues
search
allenai
/
dolma
Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
972
stars
107
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Adding support for Classifiers and Search tools
#219
soldni
opened
1 week ago
0
dedupe_paragraphs didn't working
#218
wannaphong
closed
1 week ago
2
PII didn't working
#217
wannaphong
closed
1 week ago
3
Fixed issues and improved documentation in getting-started.md
#216
aman-17
closed
2 weeks ago
0
Mixer validator
#215
mariia-iureva
closed
1 week ago
4
DCLM Style Deduplications
#214
revbucket
opened
1 month ago
0
'File partition' option and 'document' directory specification
#213
Whattabatt
closed
3 weeks ago
0
Mattj/requirements
#212
revbucket
opened
1 month ago
0
Add requirements.txt?
#211
revbucket
closed
1 month ago
1
DNM: Patch FT Tagger
#210
undfined
opened
1 month ago
0
FastText training data
#209
msaebi1993
opened
1 month ago
0
Also pin maturin in action
#208
undfined
closed
1 month ago
0
Pin maturin in CI
#207
undfined
closed
1 month ago
0
Bump version with postfix for PyPI
#206
undfined
closed
1 month ago
0
Adds dclm fasttext classifier
#205
undfined
closed
1 month ago
0
Undfined/runner v3
#204
undfined
closed
1 month ago
0
Revert upload/download to v3 for now
#203
undfined
closed
1 month ago
0
Dependabot fail, match upload/download action versions
#202
undfined
closed
1 month ago
0
after pip install dolma, use the dolma--help, return zipfile.BadZipFile: File is not a zip file
#201
XiaozhuLove
closed
1 month ago
3
Polymorphic span replacement
#200
undfined
closed
1 month ago
0
Mixer progress tracking
#199
aetting
opened
1 month ago
0
Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows in the github_actions group across 1 directory
#198
dependabot[bot]
closed
1 month ago
0
Fix bug in length filtering for deduping
#197
soldni
closed
1 month ago
0
[Json fooramt error in line 133] Update getting-started.md
#196
yushengsu-thu
closed
1 month ago
0
Use Numpy v1.x instead of 2.x
#195
soldni
closed
2 months ago
0
[Issue: `dolma.core.errors.DolmaFatalError` in Step 1: Run Taggers]
#194
yushengsu-thu
closed
2 months ago
2
Update getting-started.md
#193
yushengsu-thu
closed
2 months ago
1
[Dolma Tutorial (https://allenai.github.io/docs): 707NotFound]
#192
yushengsu-thu
opened
2 months ago
2
Changing Formatter
#191
soldni
closed
1 month ago
0
Allow specifying different bins for visualization and computation.
#190
soldni
opened
2 months ago
0
Added tokenizers for length
#189
soldni
closed
2 months ago
0
Changed entrypoint to increase IntelliJ compatibility
#188
soldni
closed
2 months ago
0
Bump nltk from 3.8.1 to 3.9 in the pip group across 1 directory
#187
dependabot[bot]
closed
2 months ago
0
Count Bytes and Docs
#186
soldni
closed
2 months ago
0
Fix local installation on MacOS
#185
epwalsh
closed
2 months ago
0
Fix Tests to pass with new mixer behavior
#184
soldni
closed
2 months ago
0
Always use inferred extension
#183
undfined
closed
2 months ago
0
Bump to 1.0.7
#182
undfined
closed
2 months ago
0
V2 of Gopher tagger
#181
soldni
closed
3 months ago
0
Cherry pick zstd compressor
#180
undfined
closed
3 months ago
0
Version bump for new release (1.0.4)
#179
soldni
closed
3 months ago
0
Bump openssl from 0.10.64 to 0.10.66 in the cargo group
#178
dependabot[bot]
closed
2 months ago
0
Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1?
#177
john-hewitt
opened
3 months ago
3
added option for tokenizer to split on special tokens
#176
soldni
closed
3 months ago
0
Issue with ring tokenizer
#175
davidbrandfonbrener
closed
3 months ago
1
Deduplication / Decontamination
#174
chschroeder
closed
2 days ago
3
Need help in customizing python/dolma/taggers/c4.py
#173
mihara-bot
opened
4 months ago
0
Need clarification of Gopher in Step 2
#172
mihara-bot
opened
4 months ago
0
Better Filters Error Handling
#171
soldni
closed
2 months ago
0
Adds ZST support in Deduper and Mixer
#170
soldni
closed
4 months ago
0
Next