issues
search
allenai
/
dolma
Data and tools for generating and inspecting OLMo pre-training data.
https://allenai.github.io/dolma/
Apache License 2.0
909
stars
94
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add robust median to gopher filter
#98
KennethEnevoldsen
closed
7 months ago
7
ff
#97
soldni
closed
8 months ago
0
deduplication examples does not work
#96
TTTTao725
closed
5 months ago
5
update readme
#95
kyleclo
closed
9 months ago
0
code/reasoning evaluation script
#94
benbogin
closed
8 months ago
0
When will a version that can add tagger be released?
#93
suolyer
closed
9 months ago
0
Add The Stack statistics
#92
Muennighoff
closed
8 months ago
0
ff
#91
soldni
closed
9 months ago
0
Disable cache in CI to prevent build failures
#90
soldni
closed
9 months ago
0
The Law School Admission Council | LSAC
#89
hannahzacharski55
closed
9 months ago
0
Child Support Division | LouisvilleKY.gov
#88
hannahzacharski55
closed
9 months ago
0
AllenAI
#87
hannahzacharski55
closed
9 months ago
0
Porting missing code filtering rules to dolma repo
#86
soldni
closed
9 months ago
0
Hells Angels infinite loop
#85
hannahzacharski55
closed
10 months ago
0
Hells Angels infinite loop save view open
#84
hannahzacharski55
closed
10 months ago
0
New Albany Business, Family Law and Criminal Defense Lawyer Aaron Johnson
#83
hannahzacharski55
closed
10 months ago
0
$open create sudo port forward import for https://www.mattoxandwilson.com>>all
#82
hannahzacharski55
closed
10 months ago
0
911
#81
hannahzacharski55
closed
10 months ago
0
$open Allen Wolf (infinite_loop)
#80
hannahzacharski55
closed
10 months ago
0
Git - gitk Documentation
#79
hannahzacharski55
closed
10 months ago
0
Latest version is not on PyPi
#78
KennethEnevoldsen
closed
7 months ago
3
Progress Bar may use more resources than necessary
#77
soldni
closed
7 months ago
0
ZIP Extractor - Free App for Opening and Creating ZIP Files
#76
hannahzacharski55
closed
10 months ago
1
Terminal20141030.zip - Google Drive
#75
hannahzacharski55
closed
10 months ago
1
Reddit processing code
#74
drschwenk
closed
9 months ago
0
Fix a few issues of the FixedBucketsValTracker
#73
peterbjorgensen
closed
9 months ago
1
feature to get the compliment of a hash sample
#72
IanMagnusson
closed
10 months ago
0
Fix Hardcoded Tokenizer
#71
soldni
closed
10 months ago
0
Dolma stat crashes because number of bins overflows python integer
#70
peterbjorgensen
closed
9 months ago
0
Add tagger_modules option to tagger cli
#69
peterbjorgensen
closed
10 months ago
1
Add attribute correlations
#68
Muennighoff
closed
9 months ago
0
Remove unnecessary spawn in tokenizer, fix config with multiple paths
#67
soldni
closed
10 months ago
0
Fix Accidental Override of Boolean Value
#66
soldni
closed
10 months ago
0
Fix hardcoded URL
#65
soldni
closed
10 months ago
0
Fix spawn method for multiprocessing
#64
soldni
closed
10 months ago
0
Add download instruction
#63
Muennighoff
closed
10 months ago
0
Documentation on BaseParallelProcessor
#62
soldni
closed
10 months ago
0
Baseline data
#61
IanMagnusson
opened
11 months ago
14
Text modification config
#60
rodneykinney
opened
11 months ago
1
Bump rustix from 0.37.20 to 0.37.25
#59
dependabot[bot]
closed
10 months ago
0
make_wikipedia.py fails on linux
#58
peterbjorgensen
opened
11 months ago
10
make_wikipedia.py hardcoded to simple
#57
peterbjorgensen
closed
10 months ago
0
Adding Citation text back to README
#56
soldni
closed
11 months ago
0
Fix Jekyll Docs Build
#55
soldni
closed
11 months ago
0
Adding Tokenizer, Writing Documentation, Misc Bugs & CLI improvements
#54
soldni
closed
11 months ago
0
Tokenizer and Documentation
#53
soldni
closed
11 months ago
0
Bump webpki from 0.22.0 to 0.22.2
#52
dependabot[bot]
closed
11 months ago
0
Add option to exclude empty documents in the mixer output
#51
soldni
closed
6 months ago
0
Simplify how rules in the mixer are provided
#50
soldni
opened
11 months ago
4
ff
#49
soldni
closed
11 months ago
0
Previous
Next