issues
search
NVIDIA
/
NeMo-Curator
Scalable toolkit for data curation
Apache License 2.0
327
stars
32
forks
source link
issues
Newest
Newest
Most commented
Recently updated
Oldest
Least commented
Least recently updated
Improve Common Crawl download
#82
ryantwolf
closed
14 hours ago
0
Update issue templates
#81
ryantwolf
closed
1 month ago
1
[FEA] Improve download and extract utility
#80
ryantwolf
closed
14 hours ago
0
[BUG] Better error/checks around input types being CPU/GPU
#79
ayushdg
opened
1 month ago
0
Add pull request template
#78
ayushdg
closed
1 month ago
1
[Draft] Update Fuzzy dedup params for long strings support.
#77
ayushdg
opened
1 month ago
0
Unable to install[BUG]
#76
mike2463
opened
1 month ago
3
Fix #63. Add --input-meta parameter to explicitly specify the jsonl field dtypes
#75
miguelusque
closed
4 weeks ago
3
Fuzzy Dedup: Use text_field instead of hardcoded text column
#74
ayushdg
closed
1 month ago
1
[FEA] Support dask query planning
#73
ayushdg
opened
1 month ago
1
[FEA] Make classifiers work with HuggingFace directly
#72
VibhuJawa
opened
1 month ago
5
[IMP] Hard code labels for the quality and domain classifier
#71
VibhuJawa
opened
1 month ago
0
[FEA] Add support for Multiple Model Quality Classification
#70
sarahyurick
opened
1 month ago
0
[FEA] Refactor how arguments are added to scripts
#69
miguelusque
opened
1 month ago
0
R0.3.0
#68
ryantwolf
closed
1 month ago
0
[BUG] Fuzzy deduplication fails on datasets with no duplicates
#67
Maghoumi
opened
1 month ago
1
[FEA] Update read_json to work with s3 paths.
#66
ayushdg
opened
1 month ago
0
[FEA] Add examples showing how to use both CPU & GPU modules together
#65
ayushdg
opened
1 month ago
1
[BUG] importing spacy before cluster creation leads on only 1 GPU being used.
#64
ayushdg
opened
1 month ago
0
[FEA] Add --meta parameter to explicitly specify the jsonl field dtypes
#63
miguelusque
closed
1 month ago
2
NeMo Curator ReadMe Updates
#62
jgerh
closed
4 weeks ago
3
Only import PII constants during Curator import
#61
ayushdg
closed
1 month ago
1
Align `extract_partitioning_index` logic with upstream shuffling
#60
rjzamora
closed
1 month ago
5
[BUG] OOM errors while running Deduplication
#59
vipulraheja
closed
1 month ago
1
[REVIEW] Switch Models to use Crossfit
#58
VibhuJawa
closed
1 month ago
6
Fix issue #43 (empty files creation) and improve reading/writing speed
#57
miguelusque
closed
1 month ago
0
Disable string conversion globally
#56
ryantwolf
closed
1 month ago
0
Fix indexing in PII Modifier
#55
ryantwolf
closed
1 month ago
0
Fix #53 - Add batched files reading support to separate_by_metadata script
#54
miguelusque
opened
1 month ago
1
[FEA] Add batched files reading to separate_by_metadata.py
#53
miguelusque
opened
1 month ago
1
NnoeAlphaNumericFilter has impelemented wrong[BUG]
#52
zahramahani
closed
1 week ago
2
[BUG] Jaccard Shuffle error if shuffled_docs.parquet data already exists and has been written.
#51
ayushdg
opened
1 month ago
0
Better mimic DocumentDataset's `read_*` functions to Dask's `read_*` functions
#50
sarahyurick
opened
1 month ago
1
[BUG] Jaccard Shuffle error if merge result is empty
#49
ayushdg
opened
1 month ago
0
Fuzzy dedup error if partition wise indices do not start from 0
#48
ayushdg
opened
1 month ago
0
[BUG] Read_json errors when working with empty/missing files.
#47
ayushdg
opened
1 month ago
0
High level fuzzy duplicates module
#46
ayushdg
closed
1 month ago
1
[Tutorials] Add a tutorial for PEFT data curation
#45
Maghoumi
closed
1 month ago
7
Fix #43 (empty files creation) and improve reading/writing speed
#44
miguelusque
closed
1 month ago
0
[BUG] Empty files created when invoking reshard_jsonl method at nemo_curator.utils.file_utils.py
#43
miguelusque
closed
1 month ago
1
Move common dedup utils and remove unused code
#42
ayushdg
closed
1 month ago
0
Fix failing GPU tests with latest pandas bump
#41
ayushdg
closed
2 months ago
0
Adds Nemo Curator K8s example
#40
terrykong
closed
2 months ago
0
Adds Nemo Curator K8s example
#39
terrykong
closed
2 months ago
1
[BUG] download process has memory leak during extraction to jsonl
#38
zahramahani
opened
2 months ago
0
Fix lang id example
#37
ryantwolf
closed
1 month ago
1
Improve speed of AddId module
#36
ryantwolf
closed
2 months ago
0
Fix metadata inference with pandas and dask
#35
ryantwolf
closed
2 months ago
1
Disable PyTorch Compile Multiprocessing
#34
ryantwolf
closed
2 months ago
0
[BUG] FastTextLangId doesnt return a list truely
#33
zahramahani
closed
2 months ago
4
Previous
Next