issues
search
NVIDIA
/
NeMo-Curator
Scalable data pre processing and curation toolkit for LLMs
Apache License 2.0
478
stars
57
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Add `enable_cudf_spill` to LocalCudaCluster
#268
praateekmahajan
opened
2 days ago
0
Add RPV2 Pre-training Data Curation Tutorial
#267
yyu22
opened
3 days ago
0
Expand RMM options for Python API
#266
sarahyurick
opened
4 days ago
0
Fix 264. Incorrect affinity parameter
#265
miguelusque
closed
4 days ago
0
Unknown affinity parameter in SDG-Retriever_Eval_Tutorial notebook
#264
miguelusque
closed
4 days ago
0
Fixing Model Address
#263
chrisalexiuk-nvidia
closed
4 days ago
0
Added example notebook for translation with ct2 model.
#262
uahmed93
opened
4 days ago
0
Add Curator and Spark example in docs
#261
ronjer30
opened
5 days ago
0
Expand RMM options for Python API
#260
sarahyurick
opened
5 days ago
1
GitHub workflows improvements
#259
sarahyurick
opened
5 days ago
3
Update deduplication docs
#258
ayushdg
opened
6 days ago
0
Japanese support for get_word_splitter
#257
oumugai
opened
1 week ago
0
Fix 255 - Improve separate_by_metadata performance for jsonl files
#256
miguelusque
opened
1 week ago
0
[FEA] Improve separate_by_metadata performance when dealing with jsonl files
#255
miguelusque
opened
1 week ago
0
Add Image Curation Tutorial
#254
ryantwolf
closed
1 week ago
4
Add GPU CI/CD
#253
sarahyurick
opened
1 week ago
1
Exact/fuzzy deduplication bug
#252
yyu22
closed
1 week ago
1
Add profiling / time in different stages of fuzzy / exact / semantic dedup
#251
praateekmahajan
opened
1 week ago
0
Fix id field
#250
ryantwolf
closed
1 week ago
0
Update SDG Documentation
#249
ryantwolf
closed
1 week ago
0
[DOC] Add documentation for fineweb edu classifier
#248
VibhuJawa
opened
1 week ago
0
Make id field optional for filtering
#247
ryantwolf
closed
1 week ago
0
Translation example with ctranslate2's Translator.
#246
uahmed93
opened
1 week ago
2
Added a translation pipeline for ctranslate2 inference
#245
uahmed93
opened
1 week ago
0
Add OOM section to Best Practices
#244
sarahyurick
opened
2 weeks ago
3
Updating Hello World Example with README and try/except
#243
chrisalexiuk-nvidia
closed
1 week ago
2
Update syntheticdata.rst
#242
jgerh
closed
1 week ago
1
[Tutorials] Update tutorial documentations
#241
Maghoumi
closed
2 weeks ago
0
Retire text_bytes_aware_shuffle and directly use shuffle
#240
VibhuJawa
opened
2 weeks ago
1
Fixing missing `await` in Code Cell.
#239
chrisalexiuk-nvidia
closed
2 weeks ago
0
Add image documentation
#238
ryantwolf
opened
2 weeks ago
0
Enabled nightly build using RAPIDS nightly
#237
praateekmahajan
closed
1 week ago
0
Re-add `test_uneven_common_crawl_range` PyTest
#236
sarahyurick
opened
2 weeks ago
0
Remove flaky PyTest
#235
sarahyurick
closed
2 weeks ago
0
Upgrade crossfit version
#234
VibhuJawa
closed
3 weeks ago
0
Make `max_text_bytes_per_part` configurable
#233
ayushdg
opened
3 weeks ago
0
Update SemDedup docs
#232
sarahyurick
closed
1 week ago
1
Add ability to build using RAPIDS nightly
#231
praateekmahajan
closed
2 weeks ago
0
Fix Bug #210
#230
nicoleeeluo
closed
1 week ago
0
Clean up ID logic
#229
sarahyurick
closed
2 weeks ago
0
Grammar and punctuation nits in Jupyter Notebooks
#228
sarahyurick
opened
3 weeks ago
3
Add text-image pair curation methods
#227
ryantwolf
closed
2 weeks ago
8
Add SDG to sidebar
#226
ryantwolf
closed
3 weeks ago
0
Add semededup documentation to index.html
#225
VibhuJawa
closed
3 weeks ago
2
[Tutorials] Fix API key missing errors in the SDG tutorial
#224
Maghoumi
closed
3 weeks ago
0
Update Readme instruction to install cython
#223
praateekmahajan
closed
3 weeks ago
2
Minhash typo
#222
sarahyurick
closed
3 weeks ago
0
Sort labels
#221
ryantwolf
closed
1 month ago
0
Added Curator and Spark example
#220
ronjer30
closed
3 days ago
1
SDG Hello World Example
#219
chrisalexiuk-nvidia
closed
3 weeks ago
1
Next