issues
search
NVIDIA
/
NeMo-Curator
Scalable toolkit for data curation
Apache License 2.0
326
stars
32
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Update index.rst
#129
aschilling-nv
closed
1 day ago
0
[REVIEW] Add Resiliparse option for text extraction
#128
sarahyurick
opened
1 day ago
0
[DO NOT MERGE] Test
#127
sarahyurick
closed
1 day ago
0
Update README.md
#126
Maghoumi
opened
2 days ago
0
Uahmed/indic translation crossfit
#125
uahmed93
opened
2 days ago
0
Add Resiliparse option for text extraction
#124
sarahyurick
closed
1 week ago
1
Fix nemo_curator import in CPU only environment when GPU packages are installed.
#123
ayushdg
opened
1 week ago
0
Hardcode labels for domain and quality classifiers
#122
sarahyurick
closed
1 day ago
1
FastTextQualityFilter model file release
#121
simplew2011
opened
1 week ago
1
Remove Numpy<2.0 pin
#120
ayushdg
opened
1 week ago
0
Pin to numpy<2 to avoid spacy compat issues
#119
ayushdg
closed
1 week ago
1
Semdedup
#118
avem-nv
opened
1 week ago
0
Fix #116. Fix task-decontamination broken links
#117
miguelusque
closed
1 week ago
2
README.md broken links
#116
miguelusque
closed
1 week ago
0
shuffle datasets
#115
simplew2011
opened
1 week ago
3
first commit
#114
avem-nv
closed
1 week ago
0
Added tutorials to index.rst
#113
jgerh
closed
1 week ago
1
exact_deduplication.py out_of_memory
#112
simplew2011
closed
1 week ago
6
Fix bug #105
#111
nicoleeeluo
opened
1 week ago
3
Shuffle CC result on group before writing out
#110
ayushdg
closed
1 week ago
0
Import fails on cpu
#109
yyu22
opened
1 week ago
0
Update tutorial & Slurm documentation
#108
ayushdg
opened
1 week ago
0
Stricter check for query planning.
#107
ayushdg
opened
1 week ago
3
Url flattening
#106
jgerh
closed
1 week ago
0
Broken Tutorial: single_gpu_tutorial.ipynb
#105
axelmagn
opened
2 weeks ago
2
Applying SEO Best Pratices
#104
aschilling-nv
closed
2 weeks ago
1
Added Nemo Curator Tutorials link
#103
jgerh
closed
1 week ago
0
Fix 69 - Refactor how arguments are added to scripts
#102
miguelusque
opened
2 weeks ago
1
fuzzy dedup in cpu
#101
simplew2011
opened
2 weeks ago
0
Update download documentation to include client creation
#100
moutasemalakkad
opened
2 weeks ago
3
Allow multiple filenames per partition when separating by metadata
#99
ayushdg
opened
2 weeks ago
0
Update requirements documentation.
#98
ayushdg
closed
3 weeks ago
0
Make sure query-planning is disabled for now
#97
rjzamora
closed
2 weeks ago
1
[REVIEW] Add Translation Module Example
#96
VibhuJawa
opened
3 weeks ago
6
Hardcode labels for domain and quality classifiers
#95
sarahyurick
closed
1 week ago
0
[WIP][Experiment] Use multi-threading for multi-file json reads
#94
rjzamora
opened
3 weeks ago
1
Update readme
#93
ayushdg
closed
3 weeks ago
2
Fix #91 - Incorrect reference to domain_classifier_example.py
#92
miguelusque
closed
3 weeks ago
1
[DOC] Incorrect reference to domain_classifier_example.py
#91
miguelusque
closed
3 weeks ago
0
Add Resiliparse option for text extraction
#90
sarahyurick
closed
1 day ago
3
ValueError: More than one filename found in partition
#89
aasthajh
opened
4 weeks ago
1
DAPT codegen data curation tutorial
#88
jvamaraju
opened
4 weeks ago
1
[DOC] Add rapids nightly install instructions for curator
#87
VibhuJawa
opened
4 weeks ago
0
Where to obtain domain and quality model files?
#86
randerzander
opened
4 weeks ago
2
find_pii_and_deidentify example fails
#85
randerzander
opened
4 weeks ago
1
[FEA] Raise a warning when creating FuzzyDuplicatesConfig with non-empty cache_dir
#84
randerzander
opened
1 month ago
1
Update documentation for new version
#83
ryantwolf
closed
3 weeks ago
0
Improve Common Crawl download
#82
ryantwolf
opened
1 month ago
0
Update issue templates
#81
ryantwolf
closed
1 month ago
1
[FEA] Improve download and extract utility
#80
ryantwolf
opened
1 month ago
0
Next